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Description 

[0001] This application Is a continuing application of U.S.S.N. 60/043,464, filed April 11, 1997, 60/054,678, filed 
August 4, 1997, and 60/061,097, filed Octobers, 1997. 

FIELD OF THE INVENTION 

[0002] Tile present invention relates to an apparatus and method for quantitative protein design and optimization. 
BACKGROUND OF THE INVENTION 

[0003] De novo protein design has received considerable attention recently, and significant advances have been 
made toward the goal of producing stable, well-folded proteins with novel sequences. Efforts to design proteins rely 
on knowledge of the physical properties that determine protein structure, such as the patterns of hydrophobic and 
hydrophlllc residues in the sequence, salt bridges and hydrogen bonds, and secondary stmcturai preferences of amino 
acids. Various approaches to apply these principles have been attempted. For example, the construction of a-helical 
and p-sheet proteins with native-like sequences was attempted by Individually selecting the residue required at every 
position in the target fold (Hecht, et aL, Science 249:884-891 (1 990); Quinn, et aL, Proc. Nati. Acad. Sci USA 91 : 
8747-8751 (1994)). Alternatively, a minimalist approach was used to design helical proteins, where the simplest pos- 
sible sequence believed to be consistent with the folded structure was generated (Regan, ef a/.. Science 241 :976-978 
(1 988); DeGrado, et at., Science 243:622-628 (1 989); Handel, et aL, Science 261 :879-885 (1 993)), with varying de- 
grees of success. An experimental method that relies on the hydrophobic and polar (HP) pattern of a sequence was 
developed where a library of sequences with the correct pattern for a four helix bundle was generated by random 
mutagenesis (Kamtekar, etaL, Science 262: 1680-1 685 (1993)). Among non de novo approaches, domains of naturally 
occurring proteins have been modified or coupled together to achieve a desired tertiary organization (PessI, et a/., 
Nature 362:367-369 (1993); Pomerantz, et at,. Science 267:93-96 (1995)). 

[0004] Though the correct secondary structure and overall tertiary organization seem to have been attained by sev- 
eral of the above techniques, many designed proteins appear to lack the stnjctural specificity of native proteins. The 
complementary geometric arrangement of amino acids in the folded protein is the root of this specificity and is encoded 
in the sequence. 

[0005] Several groups have applied and experimentally tested systematic, quantitative methods to protein design 
with the goal of developing general design algorithms (Hellinga, etal., J. Moi. Biol. 222: 763-785 (1991 ); Hurley ef a/,, 
J. Mol. Biol. 224:1143-1154 (1992); DesjarlaisI, ef a/., Protein Science 4:2006-2018 (1995); Harbury, ef a/., Proc. Natl. 
Acad. Sci. USA 92:8408-8412 (1996); Klemba, ef a/., Nat. Struc. Biol. 2:368-373(1995); Nautiyal, ef a/.. Biochemistry 
34:11645-11651 (1995); Betzo, ef a/., Biochemistry 35:6955-6962 (1996); Dahiyat, ef a/.. Protein Science 5:895-903 
(1 996); Jones, Protein Science 3:567-574 (1994); Konoi, etat., Proteins: Structure. Function and Genetics 19:244-255 
(1 994)). These algorithms consider the spatial positioning and steric complementarity of side chains by explicitly mod- 
eling the atoms of sequences under consideration. To date, such techniques have typically focused on designing the 
cores of proteins and have scored sequences with van der Waals and sometimes hydrophobic solvation potentials. 
[0006] In addition, the qualitative nature of many design approaches has hampered the development of improved, 
second generation, proteins because there are no objective methods for learning from past design successes and 
failures. 

[0007] Thus, it is an object of the Invention to provide computational protein design and optimization via an objective, 
quantitative design technique implemented in connection with a general purpose computer. 

SUI^MARY OF THE INVENTION 

[0008] In accordance with the objects outlined above, the present invention provides methods executed by a com- 
puter under the control of a program, the computer including a memory for storing the program. The method comprising 
the steps of receiving a protein baclcbone structure with variable residue positions, establishing a group of potential 
rotamers for each of the variable residue positions, wherein at least one variable residue position has rotamers from 
at least two different amino acid side chains, and analyzing the Interaction of each of the rotamers with all or part of 
the remainder of the protein backbone structure to generate a set of optimized protein sequences. The methods further 
comprise classifying each variable residue position as either a core, surface or boundary residue. The analyzing step 
may include a Dead-End Elimination (DEE) computation. Generally, the analyzing step Includes the use of at least one 
scoring function selected from the group consisting of a Van der Waals potential scoring function, a hydrogen bond 
potential scoring function, an atomic solvation scoring function, a secondary structure propensity scoring function and 
an electrostatic scoring function. The methods may further comprise generating a rank ordered list of additional optimal 
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sequences from the globally optfmal protein sequence. Some or all of the protein sequences from the ordered list may 
be tested to produce potential energy test results. 

10009] In an additional aspect, the invention provides nucleic acid sequences encoding a protein sequence generated 
by the present methods, and expression vectors and host cells containing the nucleic acids. 
[00101 In a further aspect, the invention provides a computer readable memory to direct a computer to function in a 
specified manner, comprising a side chain module to correlate a group of potential rotamers for residue positions of a 
protein backbone model, and a ranking module to analyze the interacUon of each of said rotamers with all or part of 
the remainder of said protein to generate a set of optimized protein sequences. The memory may further comprise an 

assessment module to assess the correspondence between potential energy test results and theoretical potential 
energy data. 

BRIEF DESCRIPTION OF THE DRAWINGS 
[0011] 

Figure 1 illustrates a general purpose computer configured In accordance with an embodiment of the invemion. 
Rgure 2 illustrates processing steps associated with an embodiment of the invention. 

Rgure 3 fliustrates processing steps associated with a ranking module used in accordance with an embodiment 
of the invention. After any DEE step, any one of the previous DEE steps may be repeated. In addition/any one of 
the DEE steps may be eliminated; for example, original singles DEE (step 74) need not be run. 

Rgure 4 depicts the protein design automation cycle. 

Figure 5 depicts the helical wheel diagram of a coiled coM. One heptad repeat Is shown viewed down the major 
axes of the helices. The a and d positions define the solvent-inaccessible core of the molecule (Cohen & Panv 
1990, Proteins, Structure, Function and Genetics 7:1 -15). 

Figures 6A and 6B depict the comparison of simulation cost functions to experimental Tm's. Rgure 6A depicts the 
initial cost function, which contains only a van der Waals term for the eight PDA peptides. Rgure 68 depicts the 
improved cost functton containing polar and nonpolar surface area terms weighted by atomic solvation parameter 
denved from QSAR analysis; 1 6 cal/mol/A2 favors hydrc^hobic surface burial. 

Rgure 7 shows the rank correlation of energy predicted by the simulation module versus the combined activity 

^ ^* l^"'- Bio'- 219:359-376 (1 991 ); Heilinga, et al., Proc. fMatl. Acad . Sci 
USA 91 :5803-5807 (1 994)). : 

Figure 8 shows the sequence of pdaSd aligned with the second zinc finger of Zlf268. The boxed ositions were 
designed usingthe sequence selection algorithm. The coordinates of PDB record 1 zaa (Paveietch etal. Science 
252:809-817 (1991)) from residues 33-60 were used as the structure template. In our numbering,'positio';ril5r 
responds to 1 zaa position 33. 

Rgures 9A and 98 shows the NIMR spectra and soiulion secondary structure of pda8d from Example 3 Figure 9A 
Is the TOOSY Ha-HN fingerprint region of pdaSd. Figure 9B is the NMR NOE connectivities of pdaSd. Bars rep- 
resent unambiguous connectivities and the bar thtekness of the sequential connections is indexed to the intensity 
of the resonance. 

Figures 10A and 108 depict the secondary structure content and themial stability of a90, a85, a70 and o107 
Figure 10A depicts the farUV spectra (circular dichroism). Figure 10B depicts the thermal denaturation monitored 

Rgure 11 epicts the sequence of FSD-1 of Example 5 aligned with the second zinc finger of Zif268. The bar at the 
top of the figure shows the residue position classifications: solid bars Indicate core positions, hatched bars indicate 
boundary positions and open bars indicate surface positions. The alignment matches positions of FSD-1 to the 

the six identical positions (21%) between FSD-1 and 
af268. four are buned (Ile7, Phe12, Leu18 and Ile22). The zinc binding residues of Zif268 are boxed. Represent- 
ative non-optimal sequence solutions detemilned using a Monte Carlo simulated annealing protocol are shown 
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With their rank. Vertical lines indicate identity with FSD-1 . The symbols at the bottom the figure show the degee of 
sequence conservation for each residue position computed across the top 1000 sequences: filled circles indicate 
greater than 99% conservation, half-filled circles indicate conservation between 90 and 99%, open circles indicate 
conservation between 50 and 90%, and the absence of symbol indicates less than 50% conservation. The con- 
sensus sequence determined by choosing the amino acid with the highest occurrence at each position is identical 
to the sequence of FSD-1 . 

Figure 1 2 is a schematic representation of the minimum and maximum quantities (defined in Eq 24 to 27) that are 
used to construct speed enhancements. The minima and maxima are utilized directly to find the /iJJ„. pair and 
for the comparison of extreme. The differences between the quantities, denoted with arrows, are used to construct 
the and q^^ metrics. 

Figures 13A, 13B, 13C, 13D. 13E and 13F depicts the areas involved in calculating the buried and exposed areas 
11"^ "'^^ ®- ^^^^ P™*®'" template, the heavy solid lines correspond to three rotamers 

a mree different residue positions, and the lighter solid lines correspond to surface areas, a) A o, -^for each rotamer. 
b) /»„for each rotamer c) (A^i^ - A, J summed over the three residues. The upper residue does not bury any area 
agairist the template except that buried in the tri-peptlde state A o,^ d) >4, .^for one pair of rotamere. e) The area 
buned between rotamers. (A,^Aj^-AyJ. for the same pair of rotamers al in (d). f) The area buried between ro- 
tamers, (A,^+Aj^A,jJ) summed over the three pairs of rotamers. The area b intersected by all three rotamers is 
counted twice and Is indicated by the double lines. The buried area calculated by Equation 18 is the area buried 
by the template, represented in (c), plus s times the area buried between rotamers. represented In (f). The scalind 
factor s accounts for the over-counting shown by the double lines in (f). The exposed area calculated by Equation 
1 9 IS the exposed are In the presence of the template, represented in (b). minus s times the area buried between 
rotamers, represented in (f). 

DETAILED DESCRIPTION OF THE INVENTION 

[0012] The present invention is directed to the quantitative design and optimization of amino acid sequences using 
an inverse protein folding" approach, which seeks the optimal sequence for a desired structure. Inverse folding is 
similar to protein design, which seeks to find a sequence or set of sequences that will fold Into a desired stmcture 
These approaches can be contrasted with a "protein folding" approach which attempts to predict a stnjcture taken by 
a given sequence. ' 

[00131 The general preferred approach of the present inventton is as follows, although alternate embodiments are 
discussed below. A known protein structure Is used as the starting point The residues to be optimized are then iden- 
tffied, which may be the entire sequence or subset(s) thereof. The side chains of any positions to be varied are then 
removed. The resulting structure consisting of theprotein backbone andthe remaining sidechalns Is called the template 
Each vanable residue position Is then preferably classified as a core residue, a surface residue, or a boundary residue- 
each classification defines a subset of possible amino acid residues for the position (for example, core residues gen- 
erally will be selected frorti the set of hydrophobic residues, surface residues generally will be selected from the hv- 
drophihc residues, and boundary residues may be either). Each amino add can be represented by a discrete set of all 
allowed confonners of each side chain, called rotamers. Thus, to anrlve at an optimal sequence for a backbone all 
possible sequences of rotamers must be screened, where each backbone position can be occupied either by ekch 
ammo acid in all its possible rotameric states, or a subset of amino acids, and thus a subset of rotamers 
[0014] Two sets of interactions are then calculated for each rotamer at every position: the interaction of the rotamer 
side chain with all or part of the backbone (the "singles" energy, also called the rotamer/template or lotamer/backbone 
energy), and the interaction of the rotamer side chain with all other possible rotamers at every other position or a subset 
of the other positions (the "doubles" energy, also called the rotamer/rotamer energy). The energy of each of these 
interactions is calculated through the use of a variety of scoring functions, which include the energy of van der Waal's 
forces, the energy of hydrogen bonding, the energy of secondary structure propensity, the energy of surface area 
solvation and the electrostattes. Thus, the total energy of each rotamer interaction, both with the backbone and other 
rotamers. Is cateulated, and stored in a matrix forni. 

[OOiq The discrete nature of rotamer sets allows a simple calculation of the number of rotamer sequences to be' 
tested. A backbone of length n with m possible rotamers per position wifl have m" possible rotamer sequences a 
nurnber which grows exponentlalV with sequence length and renders the calculations either unwieldy or impossibte In 
real time. Accordingly, to solve this combinatorial search problem, a "Dead End Elimination" (DEE) cateulation Is per- 
fomied. The DEE calculation is based on the fact that If the worst total Interaction of a first rotamer Is still better than 
the best total Interaction of a second rotamer, then the second rotamer cannot be part of the global optimum solution 
Since the energies of all rotamers have already been calculated, the DEE approach only requires sums over the se^ 
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quence length to test and eliminate rotamers, which speeds up the calculations considerably. DEE can be renin com- 
paring pairs of rotamers. or combinations of rotamets. which will eventually result in the determination of a single 
sequence which represents the global optimum energy. 

[0016J Once the global solution has been found, a Monte Carlo search may be done to generate a rank-ordered list 

0 sequences in the neighborhood of the DEE solution. Starting at the DEE solution, random positions are changed to 

other rotamers, and the new sequence energy is calculated. If the new sequence meets the criteria for acceptance it 

8 used as a starting point for another jump. After a predetermined number of jumps, a ranlc-ordered list of sequences 
is generated. 

[0017] The results may then be experimentally verified by physically generating one or more of the protein sequences 
followed by experimental testing. The information obtained from the testing can then be fed back into the analysis to 
modify the procedure if necessary. 

[0018] Thus, the present invention provides a computer-assisted method of designing a protein. The method com- 
pnses providing a protein backbone structure with variable residue positions, and then establishing a group of potential 
rotamers for each of the residue positions. As used herein, the backbone, or template, includes the backbone atoms 
and any fixed side chains. The Interactions between the protein backbone and the potential rotamers, and between 
pairs of the potential rotamers, are then processed to generate a set of optimized protein sequences, preferably a 
single global optimum, which then may be used to generate other related sequences. 

[0019] Figure 1 illustrates an automated protein design apparatus 20 in accordance with an embodiment of the 
Invention. The apparatus 20 includes a central processing unit 22 which communicates with a memory 24 and a set 
of input/output devices (e.g., keyboard, mouse, monitor, printer, etc.) 26 through a bus 28. The general interaction 
between a central processing unit 22. a memory 24. input/output devices 26. and a bus 28 is known in the art TTie 
present invention is directed toward the automated protein design program 30 stored in the memory 24. 
[0020] The automated protein design program 30 may be implemented with a side chain module 32 As discussed 
in detail below, the side chain module establishes a group of potential rotamers for a selected protein backbone struc- 
ture. The protein design program 30 may also be implemented with a ranking module 34. As discussed in detail below 
the ranking module 34 analyzes the Interaction of rotamers with the protein backbone stmcture to generate optimized 
protein sequences. The protein design program 30 may also include a search module 36 to execute a search for 
example a Monte Cario search as described below, in relation to the optimized protein sequences. Finally an assess- 
ment module 38 may also be used to assess physical parameters associated with the derived proteins, as discussed 
further below. 

[0021] The memory 24 also stores a protein backbone structure 40, which is downloaded by a userthrough the input/ 
output devices 26. The memory 24 also stores information on potential rotamers derived by the skJe chain module 32 
In addition, the memory 24 stores protein sequences 44 generated by the ranking module 34. The protein sequences 
44 may be passed as output to the input/output devices 26. 

[0022] The operation of the automated protein design apparatus 20 is more fully appreciated with reference to Fig 
2. Fig. 2 Illustrates processing steps executed in accordance with the method of the invention. As described below 
many of the processing steps are executed by the protein design program 30. The first processing step illustrated iri 
Fig. 2 IS to provide a protein backbone stmcture (step 50). As previously Indicated, the protein backbone structure is 
downtoaded through the input/output devices 26 using standard techniques. 

[0023] The protein backbone structure corresponds to a selected protein. % "protein" herein is meant at least two 
ammo acids linked together by a peptide bond. As used herein, protein includes proteins, oligopeptides and peptides 
The peptidyl group may comprise naturally occuning amino acids and peptide bonds, or synthetic peptldomimetic 
structures, i.e. 'analogs", such as peptoids (see Simon et ai, PNAS USA 89(20):9367 (1 992)).. The amino acids may 
either be naturally occuring or non-naturally occuring; as will be appreciated by those in the art, any structure for which 
a set of rotamers is known or can be generated can be used as an amino acid. The side chains may be In either the 
(R) or the (S) configuration. In a preferred embodiment, the amino acids are in the (S) or L-configuration 
[0024] The chosen protein may be any protein for which a three dimensional structure is known or can be generated- 
that IS. for which there are three dimensional coordinates for each atom of the protein. Generally this can be determined 
using X-ray crystallographic techniques, NMR techniques, de novo modelling, homology modelling, etc. In general if 
X-ray structures are used, stmctures at 2k resolution or better are preferred, but not required. 
[0025] The proteins may be from any organism, including prokaiyotes and eukaryotes, with enzymes from bacteria 
fungi, extremeophiles such as the archebacteria, insects, fish, animais (particulariy mammals and particularly human) 
and birds all possible. 

[0026) Suitable proteins include, but are not limited to, industrial and pharmaceutical proteins, including Uganda cell 
surface receptors, antigens, antibodies, cytokines, honmones, and enzymes. Suitable classes of enzymes include, but 
are not limited to. hydrolases such as proteases, carbohydrases, lipases; isomerasessuch as racemases, epimerases, 
tautomerases, or mutases; transferases, kinases, oxidoreductases, and phophatases. Suitable enzymes are listed in 
the Swiss-Prot enzyme database. 
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[0027] Suitable protein backbones Include, but are not limited to, all of those found in the protein data base compiled 
and sen/iced by the Brool<haven National Lab. 

[0028] Specifically Included within "protein" are fragments and domains of known proteins, Including functional do- 
mains such as enzymatic domains, binding domains, etc., and smaller fragments, such as turns, loops, etc. That is, 
portions of proteins may be used as well. 

[0029] Once the protein Is chosen, the protein backbone stnjcture is Input into the computer. By "protein backbone 
structure" or grammatteal equivalents herein Is meant the three dimensional coordinates that define the three dimen- 
sional structure of a particular protein. The structures which comprise a protein backbone structure (of a naturally 
occuring protein) are the nitrogen, the carbonyl carbon, the a-carbon. and the carbonyl oxygen, along with the direction 
of the vector from the a-carbon to the p-carbon. 

[0030J The protein backbone structure which is input Into the computer can either Include the coordinates for both 
the backbone and the amino add side chains, or just the backbone, I.e. with the coordinates for the amino acid side 
chains removed. If the former is done, the side chain atoms of each amino acid of the protein structure may be "stripped" 
or removed from the structure of a protein, as Is known In the art, leaving only the coordinates for the "backbone" atoms 
(the nitrogen, carbonyl carbon and oxygen, and the a-carbon, and the hydrogens attached to the nitrogen and a- 
carbon). 

[0031] After inputing the protein structure backbone, explicit hydrogens are added if not Included within the structure 
(for example, if the structure was generated by X-ray crystallography, hydrogens must be added). After hydrogen 
addition, energy minimization of the structure is run, to relax the hydrogens as well as the other atoms, bond angles 
and bond lengths. In a preferred embodiment, this is done by doing a number of steps of conjugate gradient minimization 
(Mayo et al„ J. Phys. Chem. 94:8897 (1990)) of atomic coordinate positions to minimize the Drelding force field with 
no electrostatics. Generally from about 10 to about 260 steps is preferred, with about 50 being most preferred. 
[0032] The protein backbone structure contains at least one variable, residue position. As Is known in the art, the 
residues, or amino acids, of proteins are generally sequentially numbered starting with the N-terminus of the protein. 
Thus a protein having a methionine at rs N-terminus is said to have a methionine at residue or amino acid position 1 , 
with the next residues as 2, 3, 4, etc. At each position, the wild type (i.e. naturally occuring) protein may have one of 
at least 20 amino acids, in any number of rotamers. By "variable residue position" herein is meant an amino acid 
position of the protein to be designed that is not fixed In the design method as a specific residue or rotamer, generally 
the wild-type residue or rotamer. 

[0033] In a preferred embodiment, all of the residue positions of the protein are variable. That is, every amino acid 
side chain may be altered in the methods of the present Invention. This is particularly desirable for smaller proteins, 
although the present methods allow the design of larger proteins as well. While there is no theoretical limit to the length 
of the protein which may be designed this way, there is a practical computational limit. 

[0034] in an alternate preferred embodiment, only some of the residue positions of the protein are variable, and the 
remainder are "fixed", that Is, they are Identified In the three dimensional structure as being in a set conformation. In 
some embodiments, a fixed position is left in Its original confoimation (which may or may not correlate to a specific 
rotamer of the rotamer library being used). Alternatively, residues may be fixed as a non^vlld type residue; for example, 
when known site-directed mutagenesis techniques have shown that a particular residue is desirable (for example, to 
eliminate a proteolytic site or alter the substrate specificity of an enzyme), the residue may be fixed as a particular 
amino acid. Alternatively, the methods of the present Invention may be used to evaluate mutations de novo, as is 
discussed below. In an alternate preferred embodiment, a fixed position may be "floated"; the amino acid at that position 
is fixed, but different rotamers of that amino acid are tested. In this embodiment, the variable residues may be at least 
one, or anywhere from 0.1 % to 99.9% of the total number of residues. Thus, for example, it may be possible to change 
only a few (or one) residues, or most of the residues, with ail possibilities in between. 

[0035] In a prefenred embodiment, residues which can be fixed Include, but are not limited to, structurally or biolog- 
ically functional residues. For example, residues which are known to be important for biological activity, such as the 
residues which fonn the active site of an enzyme, the substrate binding site of an enzyme, the binding site for a binding 
partner (ligand/receptor, antigen/antibody, etc.), phosphorylation or glycosylatlon sites which are crucial to biological 
function, or stmcturally important residues, such as disulfide bridges, metal binding sites, critical hydrogen bonding 
residues, residues critical for backbone conformation such as proline or glycine, residues critical for packing interac- 
tions, etc. may all be fixed In a confonnation or as a single rotamer, or "floated". 

[0036] Similarly, residues which may be chosen as variable residues may be those that confer undesirable biological 
attributes, such as susceptibility to proteolytic degradation, dimerlzatlon or aggregation sites, glycosylation sites which 
may lead to immune responses, unwanted binding activity, unwanted allostery, undesirable enzyme activity but with a 
preservation of binding, etc. 

[0037] As will be appreciated by those in the art, the methods of the present invention allow computational testing 
of "site-directed mutagenesis" targets without actually making the mutants, or prior to making the mutants. That is, 
quick analysis of sequences in which a small number of residues are changed can be done to evaluate whether a 
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proposed change is desirable. In addition, this may be done on a l<nown protein, or on an protein optimized as described 
herein. 

[0038] As will be appreciated by those in the art, a domain of a larger protein may essentially be treated as a small 
independent protein; that is, a structural orfunctional domain of a large protein may have minimal interactions with the 
remainder of the protein and may essentially be treated as if it were autonomous. In this embodiment, all or part of the 
residues of the domain may be variable. 

[0039] it should be noted that even if a position is chosen as a variable position, it is possible that the methods of 
the invention will optimize the sequence in such a way as to select the wild type residue at the variable position This 
generally occurs more frequently for core residues, and less regularly for surface residues. In addition, it Is possible 
to fix residues as non-wild type amino acids as well. 

[0040] Once the protein backbone structure has been selected and input, and the variable residue positions chosen 
a group of potential rotamers for each of the variable residue positions is established. This operation is shown as step 
52 in Figure 2. This step may be implemented using the side chain module 32. In one embodiment of the Invention 
the side chain module 32 includes at least one rotamer library, as described below, and program code that correlates 
the selected protein baclcbone stmcture with conresponding infonnation in the rotamer library. Alternatively the side 
chain module 32 may be omitted and the potential rotamers 42 for the selected protein bacldsone structure may be 
downloaded through the input/output devices 26. 

[00411 As is known in the art, each amino acid side chain has a set of possible conformeis, called rotamers See 
Ponder, ef a/., Acad. Press Inc. (London) Ltd. pp. 775-791 (1 987); Dunbrack, et al., Struc. Biol. 1 (5):334-340 (1 994)- 
Desmet, ef a/.. Nature 356:539-542(1 992), all of which are hereby expressly incorporated by reference in their entlreity' 
Thus, a set of discrete rotamers for every amino ackJ side chain is used. There are two general types of rotamer 
libranes: backbone dependent and backbone independent. A backbone dependent rotamer library allows different 
rotamers depending on the position of the residue in the backbone; thus for example, certain leucine rotamere are 
altowed if the position is within an a helix, and different leucine rotamers are allowed if the position is not in a a-heiix 
A backbone independent rotamer libraiy utilizes ail rotamers of an amino acid at every position. In general, a backbone 
independent library is preferred in the consideration of core residues, since flexibility In the core is important. However 
backbone independent libraries are computationally more expensive, and thus for surface and boundary positions a 
backbone dependent library is prefen-ed. However, either type of library can be used at any position. 
[0042] In addition, a preferred embodiment does a type of "fine tuning" of the rotamer library by expanding the possible 
X (chi) angle values of the rotamers by plus and minus one standard deviation (or more) about the mean value in 
orderto minimize possible errors that might arise from the discreteness of the libraiy. This is partteularly important for 
aromatic residues, and fairly important for hydrophobic residues, due to the increased requirements for flexibility in the 
core and the rigidity of aromatic rings; it is not as important for the other residues. Thus a preferred embodiment 
expands the and X2 angles for all amino acids except Met, Arg and Lys. 

[0043] To roughly illustrate the numbers of rotamers, in one version of the Dunbrack & Karplus backbone-dependent 
rotamer library, alanine has 1 rotamer, glycine has 1 rotamer, arginine has 5S rotamers, threonine has 9 rotamers 
lysine has 57 rotamers, glutamic add has 69 rotamers, asparagine has 54 rotamers, aspartic acid has 27 rotamers' 
tryptophan has 54 rotamers, tyrosine has 36 rotamers, cysteine has 9 rotamers, glutamine has 69 rotamers, histidine 
has 54 rotamere, valine has 9 rotamers, isoleucine has 45 rotamers, leucine has 36 rotamers, methionine has 21 
rotamers, serine has 9 rotamers, and phenylalanine has 36 rotamers. 

[0044] In general, proline is not generally used, since it will rarely be chosen for any position, although It can be 
included if desired. Similarly, a preferred embodimentomlts cysteine as a consideration, only to avoid potential disulfide 
problems, although It can be included if desired. 

[0045] As will be appreciated by those in the art, other rotamer libraries with ail dihedral angles stamered can be 
used or generated. 

[0046] In a preferred embodiment, at a minimum, at least one variable position has rotamers from at least two different 
amino acid side chains; that Is, a sequence is being optimized, rather than a structure. 

[0047] In a preferred embodiment, rotamers from all of the amino acids (or all of them except cysteine, glycine and 
proline) are used for each variable residue position; that is, the group or set of potential rotamers at each variable 
^position is every possible rotamer of each amino acid. This Is especially preferred when the number of variable positions 
is not high as this type of analysis can be computationally expensive. 

[0048] In a prefen-ed embodiment, each variable position is classified as either a core, surface or boundary residue 
position, although in some cases, as explained below, the variable position may be set to glycine to minimize backbone 
strain. 

[0049] It should be understood that quantitative protein design or optimization studies prior to the present invention 
focused almost exclusively on core residues. The present invention, however, provides methods for designing proteins 
containing core, surface and boundary positions. Altemate embodiments utilize methods for designing proteins con- 
taining core and surface residues, core and boundary residues, and surface and boundaiy residues, as well as core 
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a^TO ^"^'"^ functions of the present invention), surface residues alone, or boundary residues 

[0050] The classification of residue positions as core, surface or boundary may be done in several ways, as will be 
appreciated by those In the art. In a preferred embodiment, the classification is done via a visual scan of the oriqinal 
protein backbone structure. Including the side chains, and assigning a classification based on a subjective evaluation 
of one stated in the art of protein modelling. Alternatively, a preferred embodiment utilizes an assessment of the ori- 
entation of the Ca-Cp vectors relative to a solvent accessible surface computed using only the template Ca atoms In 
a preferred embodiment, the solvent accessible surface for only the Ca atoms of the target fold Is generated using ihe 
Connolly algorithm with a probe radius ranging from about 4 to about 12A. with from about 6 to about lOA being 
preferred, and 8 A being particularly preferred. The Go radius used ranges from about 1 .sA to about 2.3A with fror? 
about 1 .8 to about2.1 A being preferred, and 1 .95 A being especially preferred. A residue is classified as a core position 
if a) the distance for its Co, along Its Ca-Cp vector, to the solvent accessible surface is greater than about 4-6 A with 
greater than about 5.0 A being especially preferred, and b) the distance for its Cp to the nearest surface point is greater 
than about 1.5-3 A. with greater than about 2.0 A being especially preferred. The remaining residues are classed as 
surface positions If the sum of the distances from their Ca, along their Ca-Cp vector, to the solvent accessible surface 
plus the distance from their Cp to the closest surface point was less than about 2.5-4 A, with less than about 2 7 A 
being especially preferred. All remaining residues are classified as boundary positions 

[0051] Once each variable position is classified as either core, surface or boundaiy, a set of amino acid side chains 
and thus a 88 of rotameis, Is assigned to each posWon. That is, the set of possible amino acid side chains that the 
program will allow to be considered at any particular position Is chosen. Subsequently, once the possible amino acid 
side chains are chosen, the set of rotamers that will be evaluated at a particular position can be detemiined. Thus a 
core residue will generally be selected from the group of hydrophobic residues consisting of alanine, valine, Isoleucliie 
leucine, phenylalanine, tyrosine, tryptophan, and methionine (in some embodiments, when the a scaling factor of the 
van der Waals scoring function, described below, is low, methionine is removed from the set), and the rotamer set for 
each core position potentially includes rotamers for these eight amino acid side chains (all the rotamers if a backbone 
independent library is used, and subsets if a rotamer dependent backbone is used). Similarly, surface positions are 
generally selected from the group of hydrophlllc residues consisting of alanine, serine, threonine, asparticacid aspar- 
agine, glutamme, glutamic acid, arginine, lysine and histidine. The rotamer set for each surface position thus Includes 
rotamers for these ten residues. Finally, boundaiy positions are generally chosen from alanine, serine, threonine, as- 
partic acid, asparagine. glutamine. glutamk: acid, arginine, lysine histidine, valine, isoleucine, leucine, phenylalanine 
tyrosine, tryptophan, and methionine. The rotamer set for each boundary position thus potentially includes every ro- 
temer for ttiese seventeen residues (assuming cysteine, glycine and proline are not used, although they can be) 
[0052] Thus as will be appreciated by those In the art. there Is a computational benefit to classifying the residue 
positions, as It decreases the number of calculations, it should also be noted that there may be situations where the 
sets of core, boundary and surface residues are altered from those described above; for example, under some circum- 
stances, one or more amino acids is either added or subtracted from the set of allowed amino acids. For example 
some proteins which dlmerlze or multimerize, or have ligand binding sites, may contain hydrophobte surface residues' 
eta In addrtion, residues that do not aik>w helix "capping" or the favorable interaction with an a-helix dipole may be 
m^«f« . of allowed residues. This modification of amino acid groups Is done on a residue by residue basis 

[0053] In a preferred embodiment, proline, cysteine and glycine are not included In the list of possible amino acid 
side chains, and thus the rotamers for these side chains are not used. However, in a preferred embodiment when the 
variable r^idue position has a <|> angle (that is. the dihedral angle defined by 1) the carbonyl carbon of the preceding 
amino acid; 2) the nitrogen atom of the current residue; 3) the a-carbon of the current residue; and 4) the carbonyl 
carbon of the current residue) greater than 0», the position is set to glycine to minimize backbone strain 
[0054] Once the group of potential rotamers is assigned for each variable residue position, processing proceeds to 
step 54 of Figure 2. This processing step entails analyzing interactions of the rotamers with each other and with the 
protein backbone to generate optimized protein sequences. The ranking module 34 may be used to perform these 
operations. TTiat is. computercode is written to implement the following functions. Simpllstnally. as Is generally outlined 
above, the processing initially comprises the use of a number of scoring functions, described below, to calculate en- 
ergies of interactions of the rotamers, either to the backbone itself or other rotamers 

[0055] The scoring functions Include a Van der Waals potential scoring function, a hydrogen bond potential scoring 
function an atomic solvation scoring function, a secondary structure propensity scoring function and an electrostatic 
sconng function. As is further described below, at least one scoring function is used to score each position althouqh 
the sconng functions nray differ depending on the position classification or other considerations, like favorable inter- 
action with an a-helix dipole. As outlined below, the total energy which is used In the calculatk>ns Is the sum of the 
energy of each scoring function used at a particular position, as is generally shown in Equation 1 ■ 
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Equation 1 

Etotei = nE^ + nE„ + nEh.bo„d,ng + n^ss + ^^^^ 

[00561 In Equation 1 . the total energy is the sum of the energy of the van der Waals potential (E^^^), the energy of 
atonf,ic solvation (E^^, the energy of hydrogen bonding (EH.bonding). the energy of secondary structure (E^) and the 
energy of elec rostatic interaction (E,^^. The term n is either 0 or 1 . depending on whether the term is to be wnsidered 
for the particular residue position, as is more fully outlined below. 

[0057] In a pref en-ed embodiment, a van der Waals' scoring function is used. As is known in the art, van der Waals' 
forces are the weak, non-covalent and non-ionic forces between atoms and molecules, that is, the induced dipole and 
electron repulsion (Pauli principle) forces. 

[0058] The van der Waals scoring function is based on a van der Waals potential energy. There are a number of van 
der Waals potential energy calculations, including a Lennard-Jones 12/6 potential with radii and well depth parameters 
from the Drelding force field, IMayo et ah, J. Prot. Chem. , 1990, expressly Incorporated herein by reference, or the 
exponential 6 potential. Equation 2, shown below, is the preferred Lennard-Jones potential: 

Equation 2 
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[0059] Ro is the geometric mean of the van der Waals radii of the two atoms under consideration and Dq is the 
geometric mean of the well depth of the two atoms under consideration. E^ and R are the energy and interatomic 
distance between the two atoms under consideration, as is more fully described below. 

[0060] In a preferred embodiment, the van der Waals forces are scaled using a scaling factor, a, as is generally 
discussed in Example 4. Equation 3 shows the use of o In the van der Waals Lennard-Jones potential equation: 

Equation 3 
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[0061] The role of the o scaling factor Is to change the Importance of packing effects in the optimization and design 
Of any partteular protein. As discussed in the Examples, different values for a result in different sequences being gen- 
erated by the present methods. Specifically, a reduced van der Waals sterk: constraint can compensate forthe restrictive 
effect of a fixed backbone and discrete side-chain rotamers in the simulation and can allow a broader sampling of 
sequences compatible with a desired fold. In a preferred embodiment, a values ranging from about 0 70 to about 1 1 0 
can be used, with a values from about 0.8 to about 1.05 being prefen-ed, and from about 0.85 to about 1.0 being 
especially prefen-ed. Specific a values which are pretended are 0.80, 0.85, 0.90, 0.95, 1 .00. and 1 05 
[0062] Generally speaking, variation of the van der Waals scale factor a results In four regimes of packing specificity- 
regime 1 where 0.9 s a s 1.05 and packing constraints dominate the sequence selection; regime 2 where 0 8 s a < 
0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 3 where a < 0 8 and hy- 
drophobic solvation dominates the design; and, regime 4 where o > 1 .05 and van der Waals repulsions appear to be 
too severe to allow meaningful sequence selection. In particular, different a values may be used for core surface and 
boundary posittons, with regimes 1 and 2 being preferred for core residues, regime 1 being preferred for surface res- 
idues, and regime 1 and 2 being prefen-ed for boundary residues. 

[0063] In a prefen-ed embodiment, the van der Waals scaling factor is used in the total energy calculations for each 
vanable residue position, including core, surface and boundary positions. 

[0064] In a preferred embodiment, an atomic solvation potential scoring function is used. As is appreciated by those 
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in the art, solvent interactions of a protein are a significant factor in protein stability, and residue/protein hydrophobicity 
has been shown to be the major driving force in protein folding. Thus, there Is an entropic cost to solvating hydrophobic 
surfaces, in addition to the potential for misfolding or aggregation. Accordingly, the burial of hydrophobic surfaces within 
a protein structure is beneficial to both folding and stability. Similarly, th ere can be a disadvantage for burying hydrophll ic 
residues. The accessible surface area of a protein atom is generally defined as the area of the surface over which a 
water molecule can be placed while making van der Waals contact with this atom and not penetrating any other protein 
atom. Thus, In a preferred embodiment, the solvation potential Is generally scored by taking the total possible exposed 
surface area of the moiety or two Independent moieties (either a rotamer or the first rotamer and the second rotamer), 
which is the reference, and subtracting out the "buried" area. I.e. the area which is not solvent exposed due to inter- 
actions either with the backbone or with other rotamers. This thus gives tfie exposed surface area. 
[0065] Altematlvely, a prefenred embodiment calculates the scoring function on the basis of the "buried" portion; i.e. 
the total possible exposed surface area is calculated, and then the calculated surface area after the interaction of the 
moieties is subtracted, leaving the buried surface area. A particularly preferred method does both of these calculations. 
[0066] As Is more fully described below, both of these methods can be done In a variety of ways. See Eisenberg et 
af., Nature 319:1 99-203 (1 986); Connolly. Science 221 :709-71 3 (1 983); and Wodak, et al, Proc. Natl. Acad. Sci. USA 
77(4): 1736-1 740 (1980), all of which are expressly Incorporated herein by reference. As will be appreciated by those 
In the art, this solvation potential scoring function is confomiation dependent, rather than conformation independent. 
[0067] In a preferred embodiment, the pairwise solvation potential is Implemented In two components, "singles" 
(rotamer/template) and "doubles" (rotamer/rotamer), as Is more fully described below. For the rotamer/template buried 
area, the reference state Is defined as the rotamer in question at residue position I with the backbone atoms only of 
residues 1-1 , i and i+1 , although in some instances just i may be used. Thus, in a preferred embodiment, the soh/ation 
potential is not cateulated for the interaction of each backbone atom with a particular rotamer, although more may be 
done as required. The area of the side chain is calculated with the backbone atoms excluding solvent but not counted 
in the area. The folded state is defined as the area of the rotamer in question at residue i, but now in the context of the 
entire template structure including non-optimized side chains, i.e. every other foxed position residue. The rotamer/ 
template burled area Is the difference between the reference and the folded states. The rotamer/rotamer reference 
area can be done In two ways; one by using simply the sum of the areas of the Isolated rotamers; the second includes 
the full backbone. The folded state is the area of the two rotamers placed in their relative positions on the protein 
scaffold but with no template atoms present, in a preferred embodiment, the Rfchards definition of soh^ent accessible 
surface area (Lee and Richards. J. Mol. Biol. 55:379-400, 1971, hereby incorporated by reference) Is used, with a 
probe radius ranging from 0.8 to 1 .6 A, with 1 .4 A being prefen-ed, and Drieding van der Waals radii, scaled from 0.8 
to 1 .0. Carbon and sulfur, and all attached hydrogens, are considered nonpolar. Nitrogen and oxygen, and ail attached 
hydrogens, are considered polar. Surface areas are calculated with the Connolly algorithm using a dot density of 10 
A-2 (Connolly, (1983) (supra), hereby Incorporated by reference). 

[0068] In a preferred embodiment, there is a correction for a possible overestimation of buried surface area which 
may exist in the calculation of the energy of interaction between two rotamers (but not the interaction of a rotamer with 
the backbone). Since, as is generally outlined below, rotamers are only considered in pairs, that Is. a first rotamer is 
only compared to a second rotamer during the "doubles" calculations, this may overestimate the amount of burled 
suri'ace area in locations where more than two rotamers interact, that is, where rotamers from three or more residue 
positions come together. Thus, a correction or scaling factor Is used as outlined below. 
[0069] The general energy of solvation is shown in Equation 4: 

Equation 4 



where E^a is the energy of solvation, f Is a constant used to correlate surface area and energy, and SA is the surface 
area. This equation can be broken down, depending on which parameter is being evaluated. Thus, when the hydro- 
phobic burled surface area is used, Equation 5 is appropriate: 

Equation 5 

^sa = U (^^burled hydrophobic) 

Where f^ Is a constant which ranges from about 10 to about 50 cal/mol/A^ with 23 or 26 cal/moI/A^ being preferred. 
When a penally for hydrophilic burial is being considered, the equation is shown in Equation 6: 
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Equations 

=sa = N (SAburtedhydiophoble) + f2{SAburied hydrophnic) 

Where is a constant which ranges from -50 to -250 cai/mol/A2 with -86 or -1 00 cal/mol/A2 being preferred. SImilarlv 
If a penai^ for hydrophobic exposure is used, equation 7 or 8 may be used: 

Equation 7 

Eg, = f , (SAburted hydrophobic) + hi^^xposed hydrophobic) 

Equation 8 

Esa = ^1 (SAb„ried hyd">phoblc) + ^si^fi^buriea hydrophlllo) + h(^^expo^ hydraphoblo) +U{^\xpo^ hyd«>phlllc) 

I n a preferred embodiment, fg = -f , . 

^°?7°',/A?« ?^ embodiment, backbone atoms are not included in the calculation of surface areas, and values of 23 
cal/mol/A2 (f,) and -86 cal/mol/A2 (fa) are deteimlned. 

[0071] In a prefen-ed embodiment, this overcounting problem is addressed using a scaling factor that compensates 

HIT!' P^'™"^® ^'^^ ^"''j^' *° overcounting. In this embodiment, values of 

. -26 cal/mol/A2 (fO and 1 00 cal/mol/A2 (fa) are detemiined. 

[0072] Atomic solvation energy is expensive, in temis of computational time and resources. Accordingly, In a preferred 
embodiment, the solvation energy is calculated for core and/or boundary residues, but not surface residues with both 
fi^^ boundary residues being preferred, although any combination of the three is possible 

[0073] In a prefen-ed embodiment, a hydrogen bond potential scoring function is used. A hydrogen bond potential is 
'ff^i^^^ f hydrogen bonds do contribute to designed protein stability (see Stickle etal.. J. Mol. Biol. 226' 11 43 
(1892); Huyghues-Despointes etal.. Blochem. 34:13267 (1995). both of which are expressly Incorporated herein by 
reference). As outlined previously, explicit hydrogens are generated on the protein backbone structure 
[0074J In a prefered embodiment, the hydrogen bond potential consists of a distance-dependent term and an angle- 
dependent temi, as shown In Equation 9: 



Equation 9 




wiiere Ro (2.8 A) and Dq (8 kcal/moO are the hydrogen-bond equilibrium distance and well-depth, respectively, and R 
IS me donor to acceptor distance. This hydrogen bond potential is based on the potential used in DREIDING with more 
restrictive angle-dependent terms to limit the occurrence of unfavorable hydrogen bond geometries. The angle tBm 
varies depending on the hybridization state of the donor and acceptor, as shown in Equations 10, 11 12 and 13 

nS«rV,i ? ""^^ ^P' a^'^ePtor. Equation 12 is 

used for spz donor to sffi acceptor, and Equation 1 3 is used for sp2 donor to sp2 acceptor 



Equation 10 



F = cos^ 6 cos^ (0 -109.5) 
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Equation 1 1 
F = cos^ e cos^ 0 



Equation 12 
F = cos^9 



Equation 13 
F = COS e COS (max [<f , <p]) 

[0075] In Equations 1 0-13, 9 is the donor-hydrogen-acceptor angle, ^ is the hydrogen-acceptor-base angle (the base 
is the atom attached to the acceptor, for example the carbonyi carbon Is the base for a carfaonyl oxygen acceptor), and 
<p is the angle between the normals of the planes defined by the six atoms attached to the sp2 centers (the supplement 
of <p is used when <p s less than 90**). The hydrogen-bond function is only evaluated when 2.6 A < R < 3.2 A, 9 > 90<», 
^ - lOg.S** < 90» for the sp3 donor - sp^ acceptor case, and, <j) > 90° for the sp3 donor - sp2 acceptor case; preferably, 
no switching functions are used. Template donors and acceptors that are involved in template-tennplate hydrogen bonds 
are preferably not Included In the donor and acceptor lists. For the purpose of exclusion, a template-template hydrogen 
bond is considered to exist when 2.5 A ^ R 5 3.3 A and e > 135*. 

[0076] The hydrogen-bond potential may also be combined or used with a weal( coulombic temi that includes a 
distance-dependent dielectric constant of 40R, where R is the interatomic distance. Partial atomic charges are prefer- 
ably only applied to polar functional groups. A net formal charge of +1 is used for Arg and Lys and a net formal charge 
of -1 is used for Asp and Glu; see Gasteiger, et al , Tetrahedron 36:321 9-3288 (1 980); Rappe, et al , J. Phys. Chem. 
95:3358-3363 (1 991 ). ■ — 

[00771 In a preferred embodiment, an explicit penalty Is given for buried polar hydrogen atoms which are not hydrogen 
bonded to another atom. See Elsenberg, ef a/., (1986) (supra), hereby expressly incorporated by reference. In a pre- 
ferred embodiment, this penally for polar hydrogen burial. Is from about 0 to about 3 kcal/mol, with from about 1 to 
about 3 being preferred and 2 kcai/mol being particularly preferred. This penalty is only applied to buried polar hydro- 
gens not involved in hydrogen bonds. A hydrogen bond is considered to exist when Ehb ranges from about 1 to about 
4 teal/mol. with Ehb of less than -2 l<cal/mol being prefen'ed. In addition, in a preferred embodiment, the penalty Is not 
applied to template hydrogens, I.e. unpaired buried hydrogens of the backbone. 

[0078] In a prefen-ed embodiment, only hydrogen bonds between a first rotamer and the backbone are scored, and 
rotamer-rotamerhydrogen bonds are not scored. In an altemative embodiment, hydrogen bonds between a first rotamer 
and the backbone are scored, and rotamer-rotamer hydrogen bonds are scaled by 0.5 . 

[0079] In a preferred embodiment, the hydrogen bonding scoring function is used for all positions, including core, 
surface and boundary positions. In alternate embodiments, the hydrogen bonding scoring function may be used on 
only one or two of these. 

[0080] In a preferred embodiment, a secondary structure propensity scoring function is used. This is based on the 
specific amino acid side chain, and is confonnation independent. That is, each amino acid has a certain propensity to 
take on a secondary structure, either a-heiix or p-sheet, based on its ^ and y angles. See U\xf\oz et al, Current Op. 
in Biotech. 6:382 (1995); Minor, ef a/.. Nature 367:660-663 (1994); Padmanabhan, et ai. Nature 344:268-270 (1990); 
IVIunoz, etaL, Folding & Design 1(3):167-178 (1996); and Chakrabartty, eta!.. Protein Sci. 3:843(1994), all of whfch 
are expressly incorporated herein by reference. Thus, for variable residue positions that are in recognizable secondary 
structure In the backbone, a secondary structure propensity scoring function is preferably used. That is, when a variable 
residue position is in an a-helical area of the backbone, the a-hellcal propensity scoring function described below is 
calculated. Whether or not a position is in a a-he!ical area of the backbone Is determined as will be appreciated by 
those In the art, generally on the basis of 0 and y angles; for a-helix, ^ angles from -2 to -70 and v angles from -30 to 
-100 generally describe an a-hellcal area of the backbone, 

[0081] Similarly, when a variable residue position is in a p-sheet backbone confomiation, the p-sheet propensity 
scoring function Is used, p-sheet backbone conformation is generally described by ^ angles from -30 to -100 and X 
angles from +40 to +1 80. In alternate preferred embodiments, variable residue positions which are within areas of the 
backbone which are not assignable to either p-sheet or a-hellx stmcture may also be subjected to secondary structure 
propensity calcutations. 

[0082] In a prefen-ed embodiment, energies associated with secondary propensities are cateulated using Equation 
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14: 



Equation 14 
E. = 10'-<"^-"^^«-^1 

[0083] In Equation 14, (or Ep) is the energy of a-helica! propensity, AG^ Is the standard free energy of helix 
propagation of the amino acid, and AG%,a is the standard free energy of helix propagation of alanine used as a standard, 
or standard free energy of p-sheet formation of the amino acid, both of which are available in the literature (see Chakra- 
bartty, et aL, (1994) (supra), and iVIunoz, etaL, Folding & Design 1(3):167-178 (1996)), both of which are expressly 
Incorporated herein by reference), and Ngs is the propensity scale factor which Is set to range from 1 to 4, with 3.0 
being preferred. This potential is preferably selected in order to scale the propensity energies to a similar range as the 
other terms in the scoring function. 

[0084] In a pretended embodiment, p-sheet propensities are preferably calculated only where the 1-1 and 1+1 residues 
are also in p-sheet conformation. 

[0085] In a prefen-ed embodiment, the secondary structure propensity scoring function is used only in the energy 
calculations for surface variable residue positions. In alternate embodiments, the secondary structure propensity scor- 
ing function is used In the calculations for core and boundary regions as well. 

[0086] In a prefen-ed embodiment, an electrostatic scoring function is used, as shown below in Equation 15: 



Equation 15 




[0087] In this Equation, q Is the charge on atom 1 , q' Is charge on atom 2, and r Is the interaction distance. 
[0088] In a preferred embodiment, at least one scoring function is used for each variable residue position; in preferred 
embodiments, two, three or four scoring functions are used for each variable residue position. 
[0089] Once the scoring functions to be used are Identified for each variable position, the prefen-ed first step in the 
computational analysis comprises the detemiination of the interaction of each possible rotamer with all or part of the 
remainder of the protein. That is, the energy of interaction, as measured by one or more of the scoring functions, of 
each possible rotamer at each variable residue position with either the backbone or other rotamers, is calculated! In 
a preferred embodiment, the Interaction of each rotamer with the entire remainder of the protein, i.e. both the entire 
template and all other rotamers, is done. However, as outlined above, it is possible to only model a portion of a protein, 
for example a domain of a larger protein, and thus in some cases, not all of the protein need be considered. 
[0090] In a preferred embodiment, the first step of the computational processing is done by calculating two sets of 
interactions for each.rotamer at every position (step 70 of figure 3): the interaction of the rotamer side chain with the 
template or backbone (the 'singles" energy), and the interaction of the rotamer side chain with ail other possible ro- 
tamers at every other position (the "doubles" energy), whether that position is varied or floated. It should be understood 
that the backbone in this case includes both the atoms of the protein structure backbone, as well as the atoms of any 
fixed residues, wherein the fixed residues are defined as a particular conformation of an amino acid. 
[0091] Thus, "singles" (rotamerAemplate) energies are calculated for the interaction of every possible rotamer at 
every variable residue position with the backbone, using some or ail of the scoring functions. Thus, for the hydrogen 
bonding scoring function, every hydrogen bonding atom of the rotamer and every hydrogen bonding atom of the back- 
bone Is evaluated, and the Ehb is calculated for each possible rotamer at every variable position . Similarly, for the van 
der Waals scoring function, every atom of the rotamer is compared to every atom of the template (generally excluding 
the backbone atoms of its own residue), and the E^^w 's calculated for each possible rotamer at every variable residue 
position. In addition, generally no van der Waals energy Is calculated if the atoms are connected by three bonds or 
less. For the atomfc solvation scoring function, the surface of the rotamer is measured against the surface of the 
template, and the E^iox each possible rotamer at every variable residue position is calculated. The secondary structure 
propensity scoring function is also considered as a singles energy, and thus the total singles energy may contain an 
Ess temi. As will be appreciated by those in the art, many of these energy tenns will be close to zero, depending on 
the physical distance between the rotamer and the template position; that is, the farther apart the two moieties the 
lower the energy. 

[0092] Accordingly, as outlined above, the total singles energy is the sum of the energy of each scoring function used 
at a particular position, as shown in Equation 1 , wherein n is either 1 or zero, depending on whether that particular 
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scoring function was used at the rotamer position: 

Equation 1 

[0093] Once calculated, each singles E^^^^ for each possible rotamer is stored in the memory 24 within the computer, 
such that it may be used in subsequent caiculations, as outlined below. 

[0094] For the calculation of "doubles" energy (rotamer/rotamer), the Interaction energy of each possible rotamer Is 
compared with every possible rotamer at all other variable residue positions. Thus, "doubles" energies are calculated 
for the Interaction of every possible rotamer at every variable residue position with every possible rotamer at every 
other variable residue position, using some or all of the scoring functions. Thus, for the hydrogen bonding scoring 
function, every hydrogen bonding atom of the first rotamer and every hydrogen bonding atom of every possible second 
rotamer is evaluated, and the E^g Is calculated for each possible rotamer pair for any two variable positions. Similarly, 
for the van der Waals scoring function, every atom of the first rotamer is compared to every atom of every possible 
second rotamer, and the E^^ is calculated for each possible rotamer pair at every two variable residue positions. For 
the atomic solvation scoring function, the surface of the first rotamer is measured against the surface of every possible 
second rotamer, and the E^s for each possible rotamer pair at every two variable residue positions is calculated. The 
secondary structure propensity scoring function need not be run as a "doubles" energy, as it is considered as a com- 
ponent of the "singles' energy. As will be appreciated by those In the art, many of these double energy temis will be 
close to zero, depending on the physical distance between the first rotamer and the second rotamer; that Is, the farther 
apart the two moieties, the lower the energy. 

[0095] Accordingly, as outlined above, the total doubles energy is the sum of the energy of each scoring function 
used to evaluate every possible pair of rotamers, as shown In Equation 16, wherein n Is either 1 or zero, depending 
on whether that particular scoring function was used at the rotamer position: 

Equation 16 

30 

^total = "^vdiv + '^^as + "^h-bondlng + ^elee 

[0096] An example is illuminating. A first variable position, I, has three (an unreallsticalty low number) possible ro- 
tamers (which may be either from a single amino acid or different amino acids) which are labelled la, ib, and ic. A 
35 second variable position, j, also has three possible rotamers, labelled jd, je, and jf . Thus, nine doubles energies (Etotai) 
are calculated in ali: Etotai(ia, jd). Etotai(ia. je), Etetai(ia. jf). Etotaj(ib. jd), EtoiaiOb, je). E^^{\b, jf), Etotai(ic, jd). Etotaiiic. je), 
and Etotai(ic, jf). 

[0097] Once calculated, each doubles Et^ja, for each possible rotamer pair is stored in memory 24 within the computer, 
such that it may be used In subsequent calculations, as outlined below. 

40 [0098] Once the singles and doubles energies are calculated and stored, the next step of the computational process- 
ing may occur. Generally q3eaklng. the goal of the computational processing Is to determine a set of optimized protein 
sequences. By "optimized protein sequence" herein is meant a sequence that best fits the mathematical equations 
herein. As will be appreciated by those in the art, a global optimized sequence is the one sequence that best fits 
Equation 1 , i.e. the sequence that has the lowest energy of any possible sequence. However, there are any number 

45 of sequences that are not the global minimum but that have low energies. 

[0099] in a preferred embodiment, the set comprises the globally optimal sequence in its optimal confonnation, i.e. 
the optimum rotamer at each variable position. That Is, computational processing is run until the simulation program 
converges on a single sequence which Is the global optimum. 

[0100] In a preferred embodiment, the set comprises at least two optimized protein sequences. Thus for example, 
50 the computational processing step may eliminate a number of disfavored combinations but be stopped prior to con- 
vergence, providing a set of sequences of which the global optimum Is one. In addition, further computational analysis, 
for example using a different method, may be run on the set, to further eliminate sequences or rank them differently. 
Alternatively, as Is more fully described below, the global optimum may be reached, and then further computational 
processing may occur, which generates additional optimized sequences In the neighborhood of the global optimum. 
55 [0101] If a set comprising more than one optimized protein sequences is generated, they may be rank ordered in 
ternns of theoretical quantitative stability, as is more fully described below. 

[0102] In a pretended embodiment, the computational processing step first comprises an elimination step, sometimes 
refen-ed to as "applying a cutoff", either a singles elimination or a doubles elimination. Singles elimination comprises 
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the elimination of all rotamers with tennplate interaction energies of greater than about 10 kcal/mol prior to any com- 
putation: with elimination energies of greater than about 15 kcal/mol being preferred and greater than about 25 kcaU 
mol being especiaiiy prefen-ed. Similarly, doubles elimination is done when a rotamer has interaction energies greater 
than about 10 kcal/mol with all rotamers at a second residue position, with energies greater than about 15 being pre- 
ferred and greater than about 26 kcal/mol being especially prefen-ed. 

[0103] In a preferred embodiment, the computational processing comprises direct determination of total sequence 
energies, followed by comparison of the total sequence energies to ascertain the global optimum and rank order the 
other possible sequences, If desired. The energy of a total sequence is shown below in Equation 17: 



Equation 17 

^ (oial protein ~ ^(b -b) S ^(ia) S ^(iai |a) 

>U I alt i *all j ptin ^ 



[0104] Thus every possible combination of rotamers may be directly evaluated by adding the backbone-backbone 
(sometimes refen-ed to herein as template-template) energy (E^t.^) which Is constant over all sequences herein since 
the backbone Is kept constant), the singles energy for each rotamer (which has already been calculated and stored), 
and the doubles energy for each rotamer pair (which has already been calculated and stored). Each total sequence 
energy of each possible rotamer sequence can then be ranked, either from best to worst or worst to best This is 
obviously computationally expensive and becomes unwieldy as the length of the protein increases. 
[0105] In a preferred embodiment, the computational processing includes one or more Dead-End Elimination (DEE) 
computational steps. The DEE theorem is the basis for a very fast discrete search program that was designed to pack 
protein side chains on a fixed backbone with a known sequence. See Desmet, et aL, Nature 356:539-542 (1992); 
Desmet, ef a/.. The Proetin Folding Problem and Tertiary Structure Prediction. Ch. 10:1-49 (1994); Goldstein, Biophys! 
Jour 66:1335-1340 (1 994), all of which are incorporated herein by reference. DEE is based on the observation that if 
a rotamer can be eliminated from consideration at a particular position, i.e. make a determination that a particular 
rotamer Is definitely not part of the global optimal confonnation. the size of the search Is reduced. This Is done by 
comparing the worst Interaction (i.e. energy or E^^^ of a first rotamer at a single variable position with the best inter- 
action of a second rotamer at the same variable position. If the worst interaction of the first rotamer is still better than 
the best interaction of the second rotamer. then the second rotamer cannot possibly be In the optimal conformation of 
the sequence. The original DEE theorem Is shown in Equation 18: 



Equation 16 

E{ia) + j;[mln over t{E(la. jt)}] > E(ib) + £[max over t{E(lb, Jt)}] 
i J 

[0106] In Equation 1 8, rotamer la is being compared to rotamer lb. The left side of the Inequality is the best possible 
interaction energy (Et^,^^,) of la with the rest of the protein; that is, "min over f means find the rotamer t on position j 
that has the best interaction with rotamer la. Similarly, the right side of the inequality is the worst possible (max) inter- 
action energy of rotamer lb with the rest of the protein. If this inequality is true, then rotamer la is Dead-Ending and 
can be Eliminated. The speed of DEE comes from the fact that the theorem only requires sums over the sequence 
length to test and eliminate rotamers. 

[0107] In a preferred embodiment, a variation of DEE Is performed. Goldstein DEE, based on Goldstein, (1994) 
(supra), hereby expressly incorporated by reference, is a variation of the DEE computation, as shown in Equation 19: 



Equation 19 

E(ia) - E(ib) + 2:[min over t{E(la, jt) - E(ib. jt)}] > 0 

[0108] In essence, the Goldstein Equation 1 9 says that a first rotamer a of a particular posiUon i (rotamer la) will not 
contribute to a local energy minimum If the energy of confonnation with la can always be lowered by just changing the 
rotamer at that position to lb, keeping the other residues equal. If this inequality is true, then rotamer ia Is Dead-Ending 
and can be Eliminated. 
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[0109] Thus, in a preferred embodiment, a first DEE computation Is done where rotamers at a single variable position 
are compared, ("singles" DEE) to eliminate rotamers at a single position. This analysis is repeated for every variable 
position, to eliminate as many single rotamers as possible. In addition, every time a rotamer is eliminated from con- 
sideration through DEE, the minimum and maximum calculations of Equation 18 or 19 change, depending on which 
DEE variation is used, thus conceivably allowing the elimination of further rotamers. Accordingly, the singles DEE 
computation can be repeated until no more rotamers can be eliminated; that is, when the Inequality is not longer true 
such that all of them could conceivably be found on the global optimum. 

[0110] In a preferred embodiment, "doubles" DEE is additionally done. In doubles DEE, pairs of rotamers are eval- 
uated; that is, a first rotamer at a first position and a second rotamer at a second position are compared to a third 
rotamer at the first position and a fourth rotamer at the second position, either using original or Goldstein DEE. Pairs 
are then flagged as nonallowable, although single rotamers cannot be eliminated, only the pair. Again, as for singles 
DEE, every time a rotamer pair is flagged as nonallowable, the minimum calculations of Equation 18 or 19 change 
(depending on which DEE variation is used) thus conceivably allowing the flagging of further rotamer pairs. Accordingly, 
the doubles DEE computation can be repeated until no more rotamer pairs can be flagged; that Is, where the energy 
of rotamer pairs overlap such that all of them could conceivably be found on the global optimum. 
[0111] in addition, In a preferred embodiment, rotamer pairs are initially prescreened to eliminate rotamer pairs prior 
to DEE. This is done by doing relatively computationaily inexpensive calculations to eliminate certain pairs up front. 
This may be done in several ways, as is outlined below. 

[0112] In a prefen-ed embodiment, the rotamer pair with the lowest interaction energy with the rest of the system is 
found. Inspection of the energy distributions In sample matrices has revealed that an Ij^ pair that dead-end eliminates 
a particular ij^ pair can also eliminate other ij^ pairs. In fact, there are often a few i Jy pairs, which we call "magic 
bullets," that eliminate a significant number of i^ pairs. We have found that one of the most potent magic bullets is the 
pair for which maximum Interaction energy, tn,a^([i jv])kt, is least This pair is referred to as [iJvJmb' ^^'^ rotamer pair 
is used in the first round of doubles DEE, it tends to eliminate pairs faster. 

[0113] Our first speed enhancement is to evaluate the first-order doubles calcuiatlon for only the matrix elements in 
the row corresponding to the [ijvlmb P^^^- discovery of [yvlmb calculation (n = the number of rotamers per 

position), and the application of Equation 1 9 to the single row of the matrix corresponding to this rotamer pair Is another 
n^ calculation, so the calcuiatlon time is small in comparison to a full first-order doubles calculation. In practice, this 
calculation produces a large number of dead-ending pairs, often enough to proceed to the next Iteration of singles 
elimination without any further searching of the doubles matrix. 

[0114] The magic bullet first-order calculation will also discover all dead-ending pairs that would be discovered by 
the Equation 18 or 1 9, thereby making it unnecessary. This stems from the fact that ^max([yvlmb) ^^^^ '®ss than 
or equal to any E^BxCt'Jv]) ^^^^ would successfully eliminate a pair by he Equation 18 or 19. 
[0115] Since the minima and maxima of any given pair has been precalculated as outlined herein, a second speed- 
enhancement precalculation may be done. By comparing extrema, pairs that will not dead end can be Identified and 
thus skipped, reducing the time of the DEE calculation. Thus, pairs that satisfy either one of the following criteria are 
skipped: 

Equation 20 

^mlntt'Vs])<^niin{U'u/vl) 



Equation 21: 

emaxtf'VslH^maxtf'VVJ) 

[0116] Because the matrix containing these calculations Is symmetrical, half of its elements will satisfy the first ine- 
quality Equation 20, and half of those remaining will satisfy the other Inequality Equation 21 . These three quarters of 
the matrix need not be subjected to the evaluation of Equation 18 or 19, resulting in a theoretical speed enhancement 
of a factor of four. 

[0117] The last DEE speed enhancement refines the search of the remaining quarter of the matrix. This is done by 
constructing a metric from the precomputed extrema to detect those matrix elements likely to result in a dead-ending 

pair. 

[0118] A metric was found through analysis of matrices from different sample optimizations. We searched for com- 
binations of the extrema that predicted the likelihood that a matrix element would produce a dead-ending pair, interval 
sizes (see Figure 1 2) for each pair were computed from differences of the extrema. The size of the overlap of the ijg 
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and fj^ intervals were also computed, as well as the difference between the minima and the difference between the 
maxima. Combinations of these quantities, as well as the lone extrema, were tested for their ability to predict the 
occun-ence of dead-ending pairs. Because some of the maxima were very large, the quantities were also compared 
logarithmically. 

[01191 Most of the combinations were able to predict dead-ending matrix elements to varying degrees. The best 
metrics were the fractional interval overlap with respect to each pair, referred to herein as q„ and q 

Equation 22 

^ interval overlap ^maxiV M) '^minHUsi) 

" '"f^'^^/ ( I Vsl )^ e^a.(['Vs])-em/n(['Vs]) 

Equation 23 

_ mterval overlap ^max ( U M ) -^m/n ( Js] ) 
interval {[i J,]) ^^(U Jy])^S„,,([i X]) 

[0120] These values are calculated using the minima and maxima equations 24, 25, 26 and 27 (see Figure 14): 



Equation 24 



Equation 25 • 



Equation 26 



Equation 27 

k*i*j t 

[0121] These metrics were selected because they yield ratios of the occurrence of dead-ending matrix elements to 
the total occurrence of elements that are higher than any of the other metrics tested. For example, there are very few 
matrix elements (-2%) for which > 0.98, yet these elements produce 30-40% of all of the dead-ending pairs. 
[0122] Accordingly, the first-order doubles criterion Is applied only to those doubles for which q > 0 98 and q > 
0.99. The sample data analyses predict that by using these two metrics, as many as half of the dead-ending elements 
may be found by evaluating only two to five percent of the reduced matrix. 
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[0123] Generally, as Is more fully described below, single and double DEE, using either or both of original DEE and 
Goldstein DEE, is run until no further elimination is possible. Usually, convergence is not complete, and further elimi- 
nation must occur to achieve convergence. This Is generally done using "super residue" DEE. 
[0124] in a prefen-ed embodiment, additional DEE computation is done by the creation of "super residues" or "uni- 
fication", as is generally described in Desmet, Nature 356:539-542 (1992); Desmet, ef a/., The Protein Folding Problem 
and Tertiary Structure Prediction . Ch. 10:1-49 (1 994); Goldstein, et aL, supra. A super residue Is a combination of two 
or more variable residue positions which is then treated as a single residue position. The super residue is then evaluated 
in singles DEE., and doubles DEE, with either other residue positions or super residues. The disadvantage of super 
residues is that there are many more rotameric states which must be evaluated; that Is. If a first variable residue position 
has 5 possible rotamers, and a second variable residue position has 4 possible rotamers, there are 20 possible super 
residue rotamers which must be evaluated. However, these super residues may be eliminated similar to singles, rather 
than being flagged like pairs. 

[0125] The selection of which positions to combine Into super residues may be done In a variety of ways. In general, 
random selection of positions for super residues results in Inefficient elimination, but It can be done, although this is 
not preferred. In a preferred embodiment, the first evaluation is the selection of positions for a super residue is the 
number of rotamers at the position. If the position has too many rotamers. it is never unified into a super residue, as 
the computetion becomes too unwieldy, Thus, only positions with fewerthan about 100,000 rotamers are chosen, with 
less than about 50,000 being preferred and less than about 10,000 being especially preferred. 
[0126] In a prefen-ed embodiment, the evaluation of whether to forni a super residue is done as follows. All possible 
rotamer pairs are ranked using Equation 28, and the rotamer pair with the highest number Is chosen for unification: 



Equation 28 

fraction of flagged pairs 



log 



(number of super rotamers resulting from the potenttal ur^cation) 



[0127] Equation 28 is looking for the pair of positions that has the highest fraction or percentage of flagged pairs but 
the fewest number of super rotamers. That Is, the pair that gives the highest value for Equation 28 is preferably chosen. 
Thus, If the pair of positions that has the highest number of flagged pairs but also a very large number of super rotamers 
(that is, the number of rotamers at position i times the number of rotamers at position j), this pair may not be chosen 
(although it could) over a lower percentage of flagged pairs but fewer super rotamers. 

[0128] In an alternate preferred embodiment, positions are chosen for super residues that have the highest average 
energy; that Is, for positions i and j, the average energy of ait rotamers for i and all rotamers for j is calculated, and the 
pair with the highest average energy is chosen as a super residue. 

[0129] Super residues are made one at a time, preferably. After a super residue Is chosen, the singles and doubles 
DEE computations are repeated where the super residue Is treated as if it were a regular residue. As for singles and 
doubles DEE, the elimination of rotamers in the super residue DEE will alter the minimum energy calculations of DEE, 
Thus, repeating singles and/or doubles DEE can result in further elimination of rotamers. 

[0130] Figure 3 is a detailed Illustration of the processing operations associated with a ranking module 34 of the 
Invention. The calculation and storage of the singles and doubles energies 70 is the first step, although these may be 
recalculated every time. Step 72 is the optional application of a cutoff, where singles or doubles energies that are too 
high are eliminated prior to further processing. Either or both of original singles DEE 74 or Goldstein singles DEE 76 
may be done, with the elimination of original singles DEE 74 being generally preferred. Once the singles DEE is run, 
original doubles (78) and/or Goldstein doubles (80) DEE is run. Super residue DEE Is then generally run, either original 
(82) or Goldstein (84) super residue DEE. This preferably results in convergence at a global optimum sequence. As 
is depicted in Rgure 3, after any step any or all of the previous steps can be rerun, in any order, 
[0131 ] The addition of super residue DEE to the computational processing, with repetition of the previous DEE steps, 
generally results in convergence at the global optimum. Convergence to the global optimum is guaranteed if no cutoff 
applications are made, although generally a global optimum is achieved even with these steps, in a preferred embod- 
iment, DEE is run until the global optimum sequence Is found. That is, the set of optimized protein sequences contains 
a single member, the global optimum. 

[0132] In a prefen^ed embodiment, the various DEE steps are run until a managable number of sequences Is found, 
i.e. no further processing is required. These sequences represent a set of optimized protein sequences, and they can 
be evaluated as is more fully described below. Generally, for computational purposes, a manageable number of se- 
quences depends on the length of the sequence, but generally ranges from about 1 to about lO'^s possible rotamer 
sequences. 

[0133] Alternatively, DEE Is run to a point, resulting in a set of optimized sequences (in this context, a set of remainder 



EP 1 255 209 A2 



sequences) and then further compututatlonal processing of a different type may be run. For example, in one embodi- 
ment, direct calculation of sequence energy as outlined above is done on the remainder possible sequences. Alterna- 
tively, a Monte Carlo search can be run. 

[0134] In a preferred embodiment, the computation processing need not comprise a DEE computational step. In this 
embodiment, a Monte Carlo search is undertaken, as is known in tlie art. See Metropolis et a/., J. Chem. Phys. 21: 
1087 (1953), hereby incorporated by reference. In this embodiment, a random sequence comprising random rotamers 
is chosen as a start point. In one embodiment, the vanable residue positions are classified as core, boundary or surface 
residues and the set of available residues at each position is thus defined. Then a random sequence is generated, and 
a random rotamerfor each amino acid is chosen. This senses as the starting sequence of the Monte Carlo search. A 
Monte Carlo search then makes a random jump at one position, either to a different rotamer of the same amino acid 
or a rotamer of a different amino add, and then a new sequence energy (Etotai sequence) ^ calculated, and if the new 
sequence energy meets the Boitzmann criteria for acceptance, it Is used as the starting point for another jump, if the 
Boltzmann test fails, another random jump is attempted from the previous sequence. In this way, sequences with lower 
and lower energies are found, to generate a set of low energy sequences. 

[0135] If computational processing results in a single global optimum sequence, it is frequently preferred to generate 
additional sequences In the energy neighborhood of the global solution, which may be ranked. These additional se- 
quences are also optimized protein sequences. The generation of additional optimizedsequences is generally preferred 
so as to evaluate the differences between the theoretical and actual energies of a sequence. Generally, in a preferred 
embodiment, the set of sequences is at least about 75% homologous to each other, with at least about 80% homologous 
being preferred, at least about 85% homologous being parttoularly preferred, and at least about 90% being especially 
prefen^ed. In some cases, homology as high as 95% to 98% is desirable. Homology in this context means sequence 
similarity or identity, with identity being preferred. Identical in this context means identical amino acids at corresponding 
positions in the two sequences which are being compared. Homology In this context includes amino acids which are 
identical and those which are similar (functionally equivalent). This homology will be delenmlned using standard tech- 
niques known in the art, such as the Best Fit sequence program described by Devereux, etal., Nud. Acid Res. , 12: 
387-396 (1 984), or the BLASTX program (Altschul, et al., J. Mol. Biol. . 21 5:403-41 0 (1 990)) preferably using the default 
settings for either. The alignment may include the introduction of gaps in the sequences to be aligned, in addition, for 
sequences which contain either more or fewer amino acids than an optimum sequence, it is understood that the per- 
centage of homology will be determined based on the number of homologous amino acids in relation to the total number 
of amino acids. Thus, for example, homology of sequences shorter than an optimum will be detemnined using the 
number of amino acids in the shorter sequence. 

[0136] Once optimized protein sequences are identified, the processing of Figure 2 optionally proceeds to step 56 
which entails searching the protein sequences. This processing may be implemented with the search module 36. The 
search module 36 is a set of computer code that executes a search strategy. For example, the search module 36 may 
be written to execute a Monte Carlo search as described above. Starting with the global solution, random positions 
are changed to other rotamers allowed at the particular position, both rotamers from the same amino acid and rotamers 
from different amino acids. A new sequence energy (Etotai sequence) 's calculated, and if the new sequence energy 
meets the Boltzmann criteria for acceptance, it is used as the starting point for another jump. See Metropolis et a/., 
1 953, supra, hereby incorporated by reference. If the Boltzmann test fails, another random jump Is attempted from the 
previous sequence. A list of the sequences and their energies is maintained during the search. After a predetermined 
number of jumps, the best scoring sequences may be output as a rank-ordered list. Preferably, at least about 1 0^ jumps 
are made, with at least about 10^ jumps being preferred and at least about 10^ jumps being particularly preferred. 
Preferably, at least about 100 to 1000 sequences are saved, with at least about 10,000 sequences being prefen^ed 
and at least about 100,000 to 1 ,000,000 sequences being especially prefen^ed. During the search, the temperature is 
preferably set to 1 000 K. 

[0137] Once the Monte Carlo search is over, all of the saved sequences are quenched by changing the temperature 
to 0 K, and fixing the amino acid identity at each position. Preferably, every possible rotamer Jump for that particular 
amino acid at every position is then tried. 

[0138] The computational processing results in a set of optimized protein sequences. These optimized protein se- 
quences are generally, but not always, significantly different from the wild-type sequence from which the backbone 
was taken. That is, each optimized protein sequence preferably comprises at least about 5-1 0% variant amino ackJs 
from the starting or wild-type sequence, with at least about 15-20% changes being preferred and at least about 30% 
changes being particularly preferred. 

[0139] These sequences can be used in a number of ways. In a preferred embodiment, one, some or all of the 
optimized protein sequences are constructed into designed proteins, as show with step 58 of Figure 2. Thereafter, the 
protein sequences can be tested, as shown with step 60 of the Figure 2. Generally, this can be done in one of two ways. 
[0140] In a preferred embodiment, the designed proteins are chemically synthesized as is known in the art. This is 
particularly useful when the designed proteins are short, preferably less than 150 amino acids In length, with less than 
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100 amino acids being preferred, and less than 50 amino acids being particularly preferred, aithough as is known in 
the art, longer proteins can be made chemically or enzymaticaliy. 

[0141] In a prefenred embodiment, particularly for longer proteins or proteins for which large samples are desired, 
the optimized sequence is used to create a nucleic acid such as DNA which encodes the optimized sequence and 
5 which can then be cloned into a host cell and expressed. Thus, nucleic acids, and particularly DNA, can be made which 
encodes each optimized protein sequence. This Is done using well known procedures. The choice of codons, suitable 
expression vectors and suitable host cells will vary depending on a number of factors, and can be easily optimized as 
needed. 

[0142] Once made, the designed proteins are experimentally evaluated and tested for structure, function and stability, 
10 as required. This will be done as is known in the art, and will depend in part on the original protein from whk:h the 
protein backbone structure was taken. Preferably, the designed proteins are more stable than the known protein that 
was used as the starting point, although in some cases, if some constaints are placed on the methods, the designed 
protein may be less stable. Thus, for example, it is possible to fix certain residues for altered biological activity and 
find the most stable sequence, but it may still be less stable than the wild type protein. Stable in this context means 
IS that the new protein retains either biological activity or confonnatlon past the point at which the parent molecule did. 
Stability includes, but is not limited to, thermal stability, i.e. an increase in the temperature at which reversible or irre- 
versible denaturing starts to occur; proteolytic stability, I.e. a decrease In the amount of protein which is irreversibly 
cleaved In the presence of a parttoular protease (Including autolysis); stability to alterations in pH or oxidative conditions; 
chelator stability; stability to metal ions; stability to solvents such as organic solvents, surfactants, fomiulation cheml- 
20 cals; etc. 

[0143] In a prefenred embodiment, the modelled proteins are at least about 5% more stable than the original protein, 
with at least about 1 0% being prefenred and at least about 20-50% being especially preferred. 
[0144] The results of the testing operations may be computationally assessed, as shown with step 62 of Figure 2. 
An assessment module 38 may be used in this operation. That is, computer code may be prepared to analyze the test 
25 data with respect to any number of metrices. 

[0145] At this processing juncture, If the protein Is selected (the yes branch at block 64) then the protein Is utilized 
(step 66), as discussed below. If a protein is not selected, the accumulated infomiation may be used to alter the ranking 
module 34, and/or step 56 is repeated and more sequences are searched. 

[0146] in a prefenred embodiment, the experimental results are used for design feedback and design optimization. 

30 [0147] Once made, the proteins of the Invention find use in a wide variety of applications, as will be appreciated by 
those in the art, ranging from industrial to pharmocologlcal uses, depending on the protein. Thus, for example, proteins 
and enzymes exhibiting increased thermal stability may be used in industrial processes that are frequently run at 
elevated temperatures, for example carbohydrate processing (Including saccharlfication and liquifaction of starch to 
produce high f mctose corn syrup and other sweetners), protein processing (for example the use of proteases in laundry 

35 detergents, food processing, feed stock processing, baking, etc.), etc. Similarly, the methods of the present invention 
allow the generation of useful phamnaceutical proteins, such as analogs of known proteinaceous drugs which are more 
thermostable, less proteolytlcaily sensitive, or contain other desirable changes. 

[0148] The following examples serve to more fully describe the manner of using the above-described Invention, as 
well as to set forth the best modes contemplated for canying out various aspects of the invention. It is understood that 
40 these examples in no way serve to limit the true scope of this invention, but rather are presented for illustrative purposes. 
All references cited herein are explicitly incorporated by reference. 

EXAMPLES 
45 Example 1 

Protein Design Using van der Waals and Atomic Solvation Scoring Functions with DEE 

[0149] A cyclical design strategy was developed that couples theory, computation and experimental testing in order 
50 to address the problems of specificity and learning (Figure 4). Our protein design automation (PDA) cycle is comprised 
of four components: a design paradigm, a simulation module, experimental testing and data analysis. The design 
paradigm is based on the concept of inverse folding (Pabo. Nature 301 :200 (1 983); Bowie, eta/., Science 253:1 64-1 70 
(1991)) and consists of the use of a fixed backbone onto which a sequence of side-chain rotamers can be placed, 
where rotamers are the allowed confonnations of amino acid side chains (Ponder, ef a/., (1987) (supra)). Specific 
55 tertiary interactions based on the three dimensional juxtaposition of atoms are used to detemine the sequences that 
will potentially best adopt the target fold. Given a backbone geometry and the possible rotamers allowed for each 
residue position as input, the simulation must generate as output a rank ordered list of solutions based on a cost 
function that explicitly considers the atom positions in the various rotamers. The principle obstacle is that a fixed back- 



EP 1 255 209 A2 



m "on!!Tl " and m possible rotamers per residue (ali rotameis of ali allowed amino acids) results in 

SO^SllTTr:::T °' f/f r numberforeven small design problems. For example, to consTder 
50 rotamers at 15 positions results in over 1025 sequences. ^^^^^ 3, ^^^^^^^^^ ^^^^ sequences per second 
(far beyond current capabiiWes) would take 10^ years to exhaustively search for the global minimum The synS 

TIT: : °' "'"'"^ '''' ^^""^"^^^ thesimulatlon module ge3ese)^eSa1 

datatortheanalys.smodule.Theanalysissection discovers correlations between calculable prop^^^^^ 

s rucuires and the experimental observables. Tlie goal of the analysis is to suggest quanLL modSo^ to Se 

Z ? T *° ^^^^'9" P^^^-^'a-^' o^^^^or^s. the c«s?function3ir^S^^^^ 

module describes a theoretical potential energy surface whose horizontal axis comprises all possibrsoTtions tf the 
prob em at hand. This potential energy surface is not guaranteed to match the actual potential energy surface wh ch 
^e emiined from ,he experimental data. In this light, the goal of the analysis becomes^ co 'e^ion of the sTmuSon 
cost function in order to create better agreement between the theoretical and actual potential energy surfacrif sS 

ZTT 'J '""S' ^"'P"' «'^"'«««>"« be amino add sequences that b^e'aS 

the target properties. This design cyde is generally applicable to any protein system and, by remoZ the subiectte 

mm Te"Z\T^ " <a,gely unbiased approach to protein de^gn. i.e. p'rotein design' auZSr ^ ' 

S riir„° I '^^"'^^^ ^ '"P"' « ''^°'*'>"« defining the desired fold The 

tesk of des gning a sequence that takes this fold can be viewed as finding an optimal arrangement of amino ac" side 

IT/^ T Tr"" " '° ^"^'^ ^ '^''^ amino acid when evalu^^^^^^^^ 

tZ of^.h" M J° 3^°"^"'"^ ^P^'^'fl^'^y °f ^"^^-^^►'a'" P'a^^'-ent. all possible Zfoma- 

tions of each side chain must also be examined. Statistical surveys of the protein structure database (Ponde? e™/ 
supra) have defined a discrete set of allowed confomiations, called rotame^. for each amino acid ?de chain We use 
a rotamer brary based on the Ponder and Richards library to define allowed confomiations for the side chains rP^A 
J? 'T "'^^%'^««^'^P"°" ^ '^f^^'ns. an optimal sequence for a backbone can be found by scLning 
all possible sequences of rotamers. where each backbone position can be occupied by each amino acid in al ite 
possible rotameric states. The discrete nature of rotamer sets allows a simple calculation onheTumber of rotamer 

IZl^l '"''1;? '''T °' '""^'^ " P°«'«°" will have Jpossibi ^ ame 

3r,n T " "'l'^^'"^ "P^** 9rows exponentially with sequence length which for typical values of n and 

S!,wioo!r)"*"°? algorithm: An extension of the Dead End EHmination (DEE) theorem was developed (Desmet 

lem^ifDl?,rt''?T*f t^"P^^) t° combinatorlalseareh pS 

ihTi"n? n f I ! T ' ^''^ "^"^ ' '^"^ ^'"'^'"'^ «'9'»^thm that was designed to pack protein side 
i Z to JL'i r^?f " known sequence. Side chains are described by rotamers and an atomls'ct rTeSd 

s used to score rotamer arrangements. The DEE theorern guarantees that if the algorithm converges the alobal oo- 
timum packing is found. The DEE method is readily extended to our inverse folding design par^c^SrelS the 
constraint tha a position is limited to the rotamers of a single amino acid. This extension of DEeX% i tT^e? the 
ZSl? '""^T"' each position and requires a significantly modified implementation to ensuTror^gence 

means that ttie globally optimal sequence is found in its optimal conformation 

fsuD?i f.lTJZnT^T'^ f ^""^^^ '° improvements suggested by Goldstein (Goldstein. (1 994) 

inTJ«"' . !. application of the R=1 rotamer elimination and R=0 rotamer-pair f agqina 

Z lZl to find the globa soluS 

L r ""'^'"9 '■^"'^"^^ '"to "super residues" (Desmet. ef al.. (1 992( (supra)' Desmet ef 

a^. 1 994) supra); Goldstein, (1 994) (supra). However, unification can cause an unmanag^le inc S tSe number 

for applying the R=1 rotamer-pair flagging equatton increases as the fourth power of the number of rotamers These 

oZrwh K " """'^ *° '^'"^"'y "'^^ '"'^'^^^ performance, we developed a heuristic that 

inJuTedrn the R-1 rSSne! 0^;^""'^ T ^'^^ """"^^ ^^""^ ^"P- -tamer) pairs that aJ 
eneraies f?om Pni "^^l^tp ^« w 9^'"g ^'"at'on- A program called PD A_DEE was written that takes a Itet of rotamer 

m^^ tTr^n!^-f^ -^H ""'P""' ^"^"^"^^ ^Pt'^ia' confomiation with its energy. 

flA. (1992) supra)). X, and angle values of rotamers for ali amino acids except Met, Arg and Lys were expanded 
plus and minus one standard deviation about the mean value from the Ponder and Richards irbrary LnrS in order to 

f?n^T^e7arbSn;r;''' "'^'^ ''^ '"'^"^^"^^^ ^ an th7wLZdlle^ 

from the database statistics were assigned values of 0» and 1 80" for Gin and 60«, -60* and 1 80» for Met Lvs and Aro 

The number of rotamers per amino acid is: Gly. 1; Ala. 1; Val, 9; Ser, 9; Cys, 9; Vh, 9- Leu 36- 4?>re 36 Tvr 

36: Trp. 54; H«. 54; Asp, 27; Asn. 54; Glu. 69; Gin. 90; Met. 21; Lys. 57; Arg. 55. Tiie'cySc ami^o acid Pro wa;Iot 
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bond l!nl J r ^" '^'^^'^ "■"'■^^ '^""'^'"^^ «''P"''^'t t^ydrogen atoms. Rotamefs were built with 
J '"'"^'/=°""9 to"- sequence arrangements used in the search was an atomic van der Waals 

Sltt^TtL " ""fT'T' """^ ^'"""^ ''''' '"»--«ons Which aXorta 't 

wt 2 .^H w M '"'^^ d""ensional arrangement of protein side chains. A Lennard-Jones 12^ potential 

rr.? ^ ^^''""3 "^^'^ Waate interactions. Non^onded 

interactions for atoms connected by one or two bonds were not considered, van der Waals radii for atoms connected 
by three bonds were scaled by 0.5. Rotamer/rotamer pair energies and rotamer/template energies were caSatedln 

prrn"rarZ?a"n?rrf T T'^''- '''''^ (-PraW-Thetemplatainst^S 

Siater?h^ Lhl^ ^° '*P"'"'^«''- ^° Intra-side-chaln potentials were 

DEE^d rotamers with template interaction energies greater than25 kcal/mol were eliminated. Also, any rotamer whose 

TUTrf v!^^ "'T" '"P"' coordinates. Including side chains for positions no^ op 

pn?S?o"^' ^ P"'**'""" '^P*'^'^^ ^"'^ « ^''^ a^'ds to be considered a" each 
m«« ""tP"^ « "s» of rotamer/template and rotamer/rotamer energies. 

ShoiIJ! P';™"^V°'^f °" P°t^"«a' was implemented in two componente to remain consistent with the DEE 
methodology: rotamerrtemplate and rotamer/rotamer burial. For the rotemerAemplate buried area, the reference state 
was d^ined as the rotamer In question at residue /with the backbone atoms oniy of residues M anJtlTe a'ea 

s^e wrd^mr^'^tH '^'^^^^^^ '"'''""^ ''^"^ ^"^'"'^'"9 ''^""^^^ - area The foZ 

s ate was defined as the area of the rotamer in question at residue /. but now in the context of the enfire template 

nceTd T^:T.T'T'' '''' ~*«^«^/templato buried area is the difference b^^ 'the re^' 

?SrfoTdediSL^?^^^^^ 

t?m,it ♦ ^ °' '"^ P'^"^'' ^"^'^ P«'«°"s 0" the protein scaffold but with no 

template atoms present. The Richards definition of solvent accessible surface area (Lee & Richards 1 971 supra) was 
used, wtth a probe radius of 1 .4 A and Drieding van der Waals radii. Carbon and sulfur, and all aJkched hydrTgeT 
were considered nonpolar. Nitrogen and oxygen, and all attached hydrogens, were cor^sidered polar Surfaol ar^s 
were cateulated with the ConnolV algorithm using a dot density of 10 a' (Connolly. (1983) (supra» In mor^r^^t 

algorithm to speed up the calculation (Perrot, etal.. J. Comput. Chem. 13:1-11 (1992)( 

Sho^c JI?! f!!!*'': optimization, a rank ordered list of sequences was generated by a Monte 

SerJ^h 1 of the DEE solution. This list of sequences was necessary because of poS 

set Sout^rnJl T'T ^Tl '^^ '^^''^^ ^^^"^ S'"^' '"'"'"lum 

amatsite-Anewsequence energy was calculated and. If it met the Botamancriteriaforacce^^^ 

wa« used as the starting point for another jump. If the Boteman test failed, then anotherZdom jump w^Sempted 

in rSr.S^"'"''- ^ '^^"""'^^^ ^''^'^ «"«'9'«« ««« maintained thmughout me 

t^i„ LI ''"inched by changing the temperature to 0 K. fixing the amino acid identity 

PDA StT^h^^' ''Z''"^^' ^"""P '""^ P°^'«°"- "^^^ ««« implemented in a program called 

pni-s^nJp Th t 'T"' '^""Z ^'^'^ "P'™'" P^'^-D^E « of rotamer energies frorn 

PDA SETUP. PDA DEE an^ 

San 5?ego^Ar' ^^^"""^ development environment (Blosym/Molecular Simulations. 

To^Lnu^-^^!"; ^°f- c^^' ^"'^ P°A_MONTE were implemented in the CERIUS2 software development en- 
vironment (Biosym/Molecular Simulations. San Diego. CA). c'"H"iBrii en 

[0159] Model system and experimental testing: The homodimerfc coiled coil of a helices was selected as the 

rll^T TT'^ ''^g^"'"^'^" ease Characterization. TTieir sequences display a seven residue periodic HP pattern 
called a heptad repeat. (a.b.c.d.e.f.g) (Cohen & Parry. Proteins Struc. Func. Genet. 7:1-15 (1990)) "me a anS d 

sitlonsareusually hydrophobic and burled atthedlmer interface While the otherpositions are usuZp^^^^ 
GCN4-P1 (O Shea, etal.. Science 254:539 (1991)). The 16 hydrophobe a and d positions were optimized In the cnrs 
ro H '"^ P"'^^'"- '^"'"'^'"'^^ ^«^"-"°e symmetry was iced 7ni, 

?J!?!r> """^'""'eric coiled colls were modeled on the backbone coordinates of GCN4-p1, ROB ascension code 
2ZrA (Bernstein, etal.. J. Mol. Biol. 1 12:535 (1 977); O'Shea, etat.. supra). Atoms of all side chains not%Sed were 
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left In their crystallographlcally determined positions. The program BIOGRAF (Blosym/Molecular Simulations, San 
Diego, CA) was used to generate explicit hydrogens on the structure which was then conjugate gradient minimized for 
50 steps using the Drelding lorcefleld. The HP pattern was enforced by only allowing hydrophobic amino acids Into 
the rotamer groups for the optimized a and d positions. The hydrophobic group consisted of AJa, Val, Leu, lie, l\^et, 

5 Phe, Tyr and Tip for a total of 238 rotamers per position. Homodimer symmetry was enforced by penalizing by 100 
Iccal/mol rotamer pairs that violate sequence symmetry. Different rotamers of the same amino acid were allowed at 
symmetry related positions. The asparaglnef that occupies the a position at residue 1 6 was left in the template and not 
optimized. A 10^ step Monte Carlo search run at a temperature of 1000 K generated the list of candidate sequences 
ranl< ordered by their score. To test reproducibility, the search was repeated three times with different random number 

10 seeds and all trials provided essentially identical results. The Monte Carlo searches tooi( about 90 minutes. All caicu- 
iatbns in this worl( were perfomned on a Silicon Graphics 200 MHz R4400 processor 

[0161] Optimizing the 16 a and d positions each with 238 possible hydrophobic rotamers results In 23816 or 
rotamer sequences. The DEE algorithm finds the global optimum in three minutes, including rotamer energy calculation 
time. The DEE solution matches the naturally occumng GCN4-p1 sequence of a and d residues for all of the 1 6 posl- 
is tions. A one million step Monte Carlo search run at a temperature of 1000 K generated the list of sequences rank 
ordered by their score. 

[0182] To test reproducibility, the search was repeated three times with different random number seeds and all trials 
provided essentially identical results. The second best sequence is a Val 30 to Ala mutation and lies three Iccal/mol 
above the ground state sequence. Within the top 15 sequences up to six mutations from the ground state sequence 
20 are tolerated, indicating that a variety of packing an-angements are available even for a small coiled coil. Eight se- 
quences with a range of stabilities were selected for experimental testing. Including six from the top 15 and two more 
about 15 Iccal/mol higher in energy, the 66th and 70th In the list (Table 1). 



25 



TABLE 1 



30 



35 



40 



Name 


Sequence 


Rank 


Energy 


PDA-SH" 


RMKQLEDKVEELLSKNYHLENEVARLKKLVGER 


1 


-118.1 


PDA-3A 


RMKQLEDKVEELLSKNYHLENEVARLKKLAGER 


2 


-115.3 


POA-36 


RMKQLEDKVEELLSKNYHLENEMARLKKLVGER 


5 


-112.8 


PDA-3B 


RLKQMEDKVEELLSKNYHLENEVARLKKLVGER 


6 


-112.6 


PDA.3D 


RLKQMEDKVEELLSKNYHLENEVARLKKLAGER 


13 


-109.7 


PDA-3C 


RMKQWEDKAEELLSKN YHLEN EVARLKKLVGER 


14 


-109.6 


PDA-3F 


RMKQFEDKVEELLSKNYHLENEVARLKKLVGER 


56 


-103.9 


PDA-3E 


RMKQLEDKVEELLSKNYHAENEVARLKKLVGER 


70 


-103.1 



[0163] Thirty-three residue peptides were synthesized on an Applied Biosystems Model 433A peptide synthesizer 
45 using Fmoc chemistry, HBTU activation and a modified Rinic amide resin from Novabiochem, Standard 0. 1 mmol cou- 
pling cycles were used and amino termini were acetylated. Peptides were cleaved from the resin by treating approxi- 
mately 200 mg of resin with 2 mL trifluoroacetic acid (TFA) and 100 water, 100 jiL thloanisole, 50 \iL ethahedlthlol 
and 160 mg phenol as scavengers. The peptides were isolated and purified by precipitation and repeated washing 
with cold methyl tert-butyl ether followed by reverse phase HPLC on a Vydac C8 column (25 cm by 22 mm) with a 
50 linear acetonitrile-water gradient containing 0.1% TFA. Peptides were then lyophllized and stored at -20 *C until use. 
Plasma desorption mass spectrometry found all molecular weights to be within one unit of the expected masses. 
[0164] Circular dichroism CD spectra were measured on an Aviv 62DS spectrometer at pH 7.0 in 50 mM phosphate, 
150 mM NaCl and 40 \iM peptide. A 1 mm pathlength ceil was used and the temperature was controlled by a thermo- 
electric unit. Thermal melts were perfonned in the same buffer using two degree temperature Increments with an 
55 averaging time of 1 0 s and an equilibration time of 90 s. Tn, values were derived from the ellipticity at 222 nm ([61222) 
by evaluating the minimum of the d[G]222^dT"^ versus T plot (Cantor & Schimmel, Biophysical Chemistry. New York: W. 
H. Freemant and Company, 1980). The T^^'s were reproducible to within one degree. Peptide concentrations were 
determined from the tyrosine absorbance at 275 nm (Huyghues-Despolntes, ef a/., supra). 
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i?n*5inr?'*/or*^'!f chromatography was performed with a Synchropak GPC 
1 00 column (25 cm by 4.6 mm) at pH 7.0 in 50 mM phosphate and 1 50 mM NaCI at 0 'C. GCN4-p1 and p-LI (Harburv 
L1J.Z"T.lfT, (^^f)-«^«"««<^««^'^e standards. 10,1 injections of 1 mM peptide sLion were chroma: 

S J K S e . ^"'^ ^* P^"''® concentrations were approximately 60 iiM as estimated 

from peak heights. Samples were run in triplicate. 

[0166] The designed a and d seq uences were synthesized as above using the GCN4i3l sequence for the b-e and 
e.f.g positions. Standard solid phase techniques were used and following HPLC purification, the identities of the pep- 
tides were conflmied by mass spectrometry. Circular dichroism spectroscopy (CD) was used to assay the secondav 

T^Z^^V^T^ ' "^^'1'!°^ '^^"^"^'^ P^P"^'"- "^^ ^P^<='^« °f Peptides at 1 -C and a concentration 
«I;^!„T TH M rr?* 208 and 222 nm and a maximum at 1 96 nm. which are diagnostic for a helices (data not 

^Th^" tT"^ ''^"^ ^"'^^'^ °^ ♦•'^ P^P*'*« >85% helical (approximately -28000 deg 

cm2/dmol). with the exception of PDA-3C which is 75% helical at 40 mM but increases to 90% helical at 1 70 mM (Table 



Table 2. 



CD data and calculated structural properties of the PDA peptides. 



Name 



PDA- 
3H 

PDA- 
3A 

PDA- 
SB 

PDA- 
3G 

PDA- 
3F 

PDA- 
3D 

PDA- 
3C 

PDA- 
3E 



(deg 
cm2/ 

dmol) 



33000 



30300 



28200 



57 



(kcal/ 
mol) 



AAp 
(A2) 



Vol 
(A3) 



2967 2341 1830 



118.1 



48 



47 



30700 



28800 



27800 



24100 



27500 



47 



2910 2361 1725 



115.3 



Rot I EcQ 
bonds I (kcal/ 
mol) 



28 



-234 



26 



-232 



2977 2372 1830 



112.6 



3003 2383 1878 



39 



39 



26 



24 



112.8 



3000 2336 1872 



103.9 



2920 2392 1726 



109.7 



2878 2400 



109.6 



2882 2361 



103.1 



1843 



1674 



-CG I ^vdW 

(kcal (kcal^ 
/mol) I mol) 



•308 409 



■312 400 



28 



-242 



32 



-240 



-306 379 



-309 439 



Npb 



Pb 



207 



128 



203 



128 



210 



127 



212 



128 



28 



-188 



26 



■240 



26 



-149 



24 



-179 



-302 420 



-310 370 



•304 



-309 



398 



411 



212 



128 



206 



127 



215 



129 



203 



127 



'bMC the Monte uario energy; AA„p and AAp are the changes In solve nt accLible nin-polarind polar Lrface 
areas upon foWing, respectively; E,, Is the electrostatic energy using equilibrated charges; E^g is the electrostatic 
energy using Gastelger charges; E^^w is the van der Waals energy; Vol is the side chain van der Waals volume- 
Rot bonds is the number of side chain rotatable bonds (excluding methyl rotors); Npb and Pb are the number of 
Duried non-polar and polar atoms, respectively 



[01671 The melting temperatures (Tm's) show a broad range of values (data not shown), with 6 of the 8 peptkies 
Illl^'lL temperature. Also, the T^'s were not correlated to the number of sequence dlf- 



ferences from GCN4-p1, Single amino acid changes resulted In some of the most and least stable peptides, demon" 
stratmg the Importance of specificity In sequence selection. h«hi»u«5,, aemon 

[016^ Size exclusion chromatography confimied the dimeric nature of these designed peptides. Using colled coil 
peptides of known oligomerization state as standards, the PDA peptides migrated as dimers. This result £ consistent 
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with the appearance of p-btanched residues at a positions and leucines at d positions, which have been shown pre- 
viously to favor dimerlzation over other possible oligomerization states (Harbury, at al supra) 
[0169] The characterization of the PDA peptides demonstrates the successful design of several stable dimeric helical 
coiled coils. The sequences were automatically generated in the context of the design paradigm by the simulation 
module using well-defined inputs that explicitly consider the HP patterning and sterfc specificity of protein structure 
Two dimensional nuclear magnetic resonance experiments aimed at probing the specificity of the tertiary packing are 
the focus of further studies on these peptides. Initial experiments show significant protection of amide protons from 
chemical exchange and chemical shift dispersion comparable to GCN4-p1 (unpublished results) (Oas. etal Biochem- 
istry 29:289.1 (1990)); Goodman & Kim, Biochem. 30:11615 (1991)). .era/..Diocnem 
[017(q Date analysis and design feedback A detailed analysis of the correspondence between the theoretical and 
expenmental potential surfaces, and hence an estimate of the accuracy of the simulation cost function, was enabled 
by the co^^e(^ion of experimental data. Using themial stability as a measure of design perfomiance, melting tempera- 
tures of the PDA peptides were plotted against the sequence scores found in the l\^onte Carlo search (Figure 6) The 
modest con-elatioh. 0.67, in the plot shows that while an exclusively van der Waals scoring function can screen for 
stable sequences, it does not accurately predict relative stabilities. In order to address this issue, correlations between 
calculated structural properties and T„'s were systematically examined using quantitative structure activity reiation- 
Sm 28-1133 098^^^^^ * statistical technique commonly used in structure based drug design (Hopfinger. J. Med. 

[01711 Table 2 lists various molecular properties of the PDA peptides in addition to the van der Waals based Monte 
Carlo scores and the experimentally detemilned T„'s. A wide range of properties was examined, including molecular 
mechanics components, such as electrostatic energies, and geometric measures, such as volume. The goal of QSAR 
is the generation of equations that closely approximate the experimental quantity, in this case T„. as a function of the 
calculated properties. Such equations suggest which properties can be used in an improved cost function The PDA 
analysis module employs genetic function approximation (QFA) (Rogers & Hopfinger, J. Chem. Inf. Comput. Scie 34- 
854 (1994)), a novel method to optimize QSAR equations that selects which properties are to be included and the 
relative weightings of the properties using a genetic algorithm. GFA accomplishes an efficient search of the space of 
possible equations and robustly generates a list of equations ranked by their corrBlatron to the data 
[0172] Equations are scored by lack of fit (LOF), a weighted least square error measure that resists overfitting by 
penalizing equations with more terms (Rogers & Hopfinger, supra). GFA optimizes both the length and the composition 
of the equations and, by generating a set of QSAR equations, clarifies combinations of properties that fit well and 
properties that recur in many equations. All of the top five equations that correct the simulation energy (Emc) contain 
burial of nonpolar surface area, Mnp (Table 3). uc/ 



Table 3. 


Top five QSAR equations generated by GFA with LOF, correlation coefficient and cross validation scores 


QSAR equation 


LOF 


r2 


CVr2 


"■'•^^•Emc + 0.1*AAnp - 0.73*Npb 


16.23 


.98 


.78 


-1 .78*E^^c + 0-20*AAnp ' 2.43*Rot 


23.13 


.97 


.75 


-1 ^9*Emc + 0-1 ^Mnp - 0,05*Vol 


24.57 


.97 


.36 


-1.54*BMc + 0.irM^ 


26.45 


^91 


.80 


-1 .60*E^^c + 009*AAnp - 0.12*AAp 


33.88 


.96 




J^p ana AAp are nonpolar and polar surface buried upon folding, respectively. Vo is side chain volume Npb 
IS the number of buried nonpolar atoms and Rot is the number of buried rotatable bonds 



7?J Pi'es®"** of M„p in all of the top equations, in addition to the low LOF of the QSAR containing only Eur 
and AA„p, strongly implicates nonpolar surface burial as a critical property forpredicting peptide stability. This conclusion 
IS not suipnsing given the role of the hydrophobic effect in protein energetics (Dill, Biochem. 29:71 33 (1 990)) 
[0174] Properties were calculated using BIOGRAF and the Drelding forcefield. Solvent accessible surface areas 
were calculated with the Connolly algorithm (Connolly, (1 983) (supra)) using a probe radius of 1 .4 A and a dot density 
Of 1 0 A-z Volumes were calculated as the sum of the van der Waals volumes of the side chains that were optimized 
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The number of buried polar and nonpolar heavy atoms were defined as atoms, with their attached hydrogens, that 
expose iess than 5 A2 in the surface area calculation. Electrostatic energies were calculated using a dielectric of one 
and no cutoff was set for calculation of non-bonded energies. Charge equilibration charges (Rappe & Goddard III J 
Phys. Chem. 95:3358 (1991) and Gasteiger (Gasteiger & IVIarsili, Tetrahedron 36:3219 (1980) charges were used to 
generate electrostatic energies. Charge equilibration charges were manually adjusted to provide neutral backbones 
and neutral side chains in order to prevent spurious monopole effects. The selection of properties was limited by the 
requirement that properties could not be highly correlated. Correlated properties cannot be differentiated by QSAR 
techniques and only create redundancy in the derived relations. 

[0175] Genetic function approximation (GFA) was perfomied in the CERIUS2 simulation package version 1 6 (Bio- 
sym/Moiecular Simulations. San Diego, CA). An initial population of 300 equations was generated consisting of random 
combinations of three properties. Only linearterms were used and initial coefficients were detemiined by least squares 
regression for each set of properties. Redundant equations were eliminated and 10000 generations of random cross- 
over mutations were performed. If a child had a better score than the worst equation in the population, the child replaced 
the worst equation. Also, mutation operators that add or remove terms had a 50% probability of being applied each 
generation, but these mutations were only accepted if the score was improved. No equation with greater than three 
temris was allowed. Equations were scored during evolution using the lack of fit (LOF) parameter, a scaled least square 
error (LSE) measure that penalizes equations with more temis and hence resists overfitting. LOF is defined as- 



(1 - 2C)2- 



where o is the number of ternis in the equation and M is the number of data points. Five different randomized runs 
were done and the final equation populations were pooled. Only equations containing the simulation energy Eue were 
considered which resulted in 1 08 equations ranked by their LOR 

[01761 To assessthepredfctivepoweroftheseQSARequations,aswellasthelrrobustness,crossvaiidation analysis 
was earned out. Each peptide was sequentiaHy removed from the data set and the coefficients of the equatton In 
question were refit. This new equation was then used to predict the withheld data point. When all of the data points 
had been predicted in this manner, their correlation to the measured T„"s was computed (Table 3). Only the E»c/AA 
QSAR and the Eud^^ QSAR perfomied well in cross validation. The Eyc/AA^ equation could not be expecte"d 
to fit the data as smoothly as QSAR's with three terms and hence had a lower cross validated r^. However, all other 
two tern QSAR's had LOF scores greater than 48 and cross validation conflations less than 0.55 (data not shown) 
The QSAR analysis independently predicted with no subjective bias that consideration of nonpolar and polar surface 
area burial is necessary to improve the simulation. This result is consistent with previous studies on atomic solvation 
potentials (Ssenberg, etal., (1986) (supra); Wesson, etat., Protein Sci. 1:227 (1992)). 

[0177] Further, simpler structural measures, such as number of buried atoms, that reflect underlying principles such 
as hydrophobic solvation (Chan, etal.. Science 267: 1 463 (1 995)) were not deemed as significant by the QSAR analysis 
These results justify the cost of calculating actual surface areas, though in some studies simpler potentials have been 
shown to perform well (van Gunsteren, etal. J. Mol. Biol 227:389 (1992)). 

[0178] ABpp and Mp were introduced into the simulation module to correct the cost function. Contributions to surface 
burial from rotamer/template and rotamer/rotamer contacts were calculated and used In the Interaction potential In- 
dependently counting buried surface from different rotamer pairs, whteh is necessary in DE E, leads to overestimation 
of burial because the radii of so^ent accessible surfaces are much larger than the van der Waais contact radii and 
hence can overlap greatly in a close packed protein core. To account for this discrepancy, the areas used in the QSAR 
were recalculated using the painwise area method and a new Eud^f/^ QSAR equation was generated. The ratios 
of the Emc coefficient to the AA^p and Mp coefficients are scale factors that are used in the simulation module to 
convert buried surface area into energy, i.e. atomic solvation parameters. Thermal stabilities are predicted well by this 
cost function (Figure 6B). In addition, the improved cost functionstillpredicts the naturally occurring GCN4i)1 sequence 
as the ground state. The surface area to energy scale factors, 1 6 cal/mol/Aa favoring nonpolar area burial and 86 cal/ 
mom opposing polar area burial, are similar in sign, scale and relative magnitude to solvation potential parameters 
derived from small molecule transfer data (Wesson & Eisenberg, supra). 

[0179] X repressor mutants: To demonstrate the generality of the cost function, other proteins were examined using 
the simulaton module. A library of core mutants of the DNA binding protein X repressor has been extensively charac- 
^^Z^.^^^^J'J' ^"^ C-lm & Sauer, J. Mol. Biol. 219:359 (1 991)). Template cooitilnates were taken from 

PDB file 1LMB (Beamer & Pabo, J. Mol. Biol. 227:177 (1992)). The subunit designated chain 4 in the PDB file was 
removed from the context of the rest of the structure (accompanying subunit and DNA) and using BIOGRAF explicit 
hydrogens were added. The hydrophobic residues with side chains within 5 A of the three mutation sites (V36 M40 
V47) are Y22, L31, A37, M42. L50, F51. L64. L65, 168 and L69. All of these residues are greater than 80% buried 
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except for M42 which is 65% buried and L64 which is 45% buried. A37 only has one possible rotamer and hence was 
not optimized. The other nine residues in the 5 A sphere were allowed to take any rotamer conformation of their amino 
acid ("floated"). The mutation sites were allowed any rotamer of the amino acid sequence In question. Depending on 
the mutant sequence, 5 x 1 0^6 to 7 x 10^8 conformations were possible. Rotamer energy and DEE caiculation times 

5 were 2 to 4 minutes. The combined activity score is that of Hellinga and Richards (Hellinga, et al., (1994) (supra)). 
Seventy-eight of the 126 possible combinations were generated. Also, this dataset has been used to test several 
computational schemes and can sen/e as a basis for comparing different forcefields (Lee & Levitt, Nature 362:448 
(1991); van Gunsteren & Marie, supra; Hellinga, etal., (1994) (supra)). The simulation module, using the cost function 
found by QSAR, was used to find the optimal conformation and energy for each mutant sequence. All hydrophobic 

10 residues within 5 A of the three mutation sites were also left free to be relaxed by the algorithm. This 5 A sphere 
contained 12 residues, a significantly larger problem than previous efforts (Lee & Levitt, supra; Hellinga, (1 994) (supra)), 
that were rapidly optimized by the DEE component of the simulation module. The rank correlation of the predicted 
energy to the combined activity score proposed by Hellinga and Richards is shown in Figure 7. The wlldtype has the 
lowest energy of the 126 possible sequences and the correlation is essentially equivalentto previously published results 

15 Which demonstrates that the QSAR corrected cost function is not specific for coiled coils and can model other proteins 
adequately. 

Exanple 2 

20 Automated design of the surface positions of protein helices 

[0180] GCN4-pl, a homodimeric coiled coil, was again selected as the model system because it can be readily syn- 
thesized by solid phase techniques and its helical secondary structure and dimeric tertiary organization ease charac- 
terization. The sequences of homodimeric coiled colls display a seven residue perlodte hydrophobte and polar pattern 

25 called a heptad repeat, (a-b-c-d-e-f-g) (Cohen & Parry, supra). The a and d positions are buried at the dimer interface 
and are usually hydrophobic, whereas the b, c, e, f , and g positions are solvent exposed and usually polar (Figure 5). 
Examination of the crystal structure of GCN4-p1 (O'Shea, et al.. supra) shows that the b, c, and f side chains extend 
into solvent and expose at least 55% of their surface area. In contrast, the e and g residues bury from 50 to 90% of 
their surface area by packing against the a and d residues of the opposing helix. We selected the 1 2 b, c, and t residue 

30 positions for surface sequence design: positions 3. 4. 7. 10. 11, 14, 17, 18, 21, 24, 25. and 28 using the numbering 
from PDB entry 2zta (Bernstein, et al., J. Mol. Biol. 112:535(1 977)). The remainder of the protein structure, including 
all other side chains and the backbone, was used as the template for sequence selection calculations. The symmetry 
of the dimer and lack of interactions of surface residues between the subunits allowed independent design of each 
subunlt, thereby significantly reducing the size of the sequence optimization problem. 

35 [01 811 All possible sequences of hydrophilic amino acids (D, E. N, Q, K, R, S, T, A, and H) for the 1 2 surface positions 
were screened by our design algorithm. The torsional flexibility of the amino acid side chains was accounted for by 
considering a discrete set of all allowed conformers of each side chain, called rotamers (Ponder, etal., (1987( (supra); 
Dunbrack, et aL, Struc. Biol. Vol. 1 (5):334-340 (1 994)). Optimizing the 1 2 b, c, and f positions each with 1 0 possible 
amino acids results In 1 0^2 possible sequences which corresponds to -1 028 rotamer sequences when using the Dun- 

40 brack and Karplus backbone-dependent rotamer library. The immense search problem presented by rotamer sequence 
optimization Is overcome by application of the Dead-End Elimination (DEE) theorem (Desmet, etaf,, (1992( (supra); 
Desmet, etaL, (1994) (supra); Goldstein, (1994) (supra)). Our implementation of the DEE theorem extends its utility 
to sequence design and rapidly finds the globally optimal sequence in its optimal conformation. 
[0182] We examined three potential-energy functions for their effectiveness In scoring surface sequences. Each 

45 candidate scoring function was used to design the b, c, and f positions of the model coiled coil and the resulting peptide 
was synthesized and characterized to assess design perfonnance. A hydrogen-bond potential was used to check if 
predicted hydrogen bonds can contribute to designed protein stability, as expected from studies of hydrogen bonding 
in proteins and peptides (Stickle, et a/., supra; Huyghues-Despointes, ef a/., supra). Optimizing sequences for hydrogen 
bonding, however, often buries polar protons that are not involved in hydrogen bonds. This uncompensated loss of 

50 potential hydrogen-bond donors to water prompted examination of a second scoring scheme consisting of a hydrogen- 
bond potential in conjunction with a penalty for burial of polar protons (Eisenberg, (1986) (supra)). We tested a third 
scoring scheme which augments the hydrogen bond potential with the empirically derived helix propensities of Baldwin 
and cowortters (Chakrabartty, et al„ supra). Although the physical basis of helix propensities is unclear, they can have 
a significant effect on protein stability and can potentially be used to improve protein designs (O'Neil & DeGrado, 1990; 

55 Zhang, ef a/., Biochem. 30:2012(1991); Blaber, etal, Science 260:1637 (1993); O'shea, ef a/., 1993; Villegas, etai. 
Folding and Design 1 :29 (1 996)). A van der Waals potential was used in all cases to account for packing Interactions 
and excluded volume. 

[01 83] Several other sequences for the b, c and f positions were also synthesized and characterized to help discern 
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the relative Importance of the hydrogen-bonding and helix-propensity potentials. The sequence designed with the 
hydrogen-bond potential was randomly scrambled, thereby disrupting the designed interactions but not changing the 
helix propensity of the sequence. Also, the sequence with the maximum possible helix propensity, all posiUons set to 
alanine, was made. Finally, to sen/e as undesigned controls, the naturally occurring GCN4-p1 sequence and a se- 
quence randomly selected from the hydrophllic amino acid set were synthesized and studied. 

and DEE: The protein structure was modeled on the backbone co- 
ordinates of GCN4-P1 , PDB record 2zta (Bernstein, et al., supra; O'Shea, etal.. supra). Atoms of all side chains not 
optimized were left in their crystallographically detemilned positions. The program BIOGRAF (IWolecular Simulations 
incorporated. San Diego. CA) was used to generate explicit hydrogens on the structure which was then conjugate 
gradient m nimized for 50 steps using the DREIDING forcefield (IVIayo, et al.. 1 990, supia). The symmetry of the dimer 
and lack of interactions of surface residues between the subunits allowed independent design of each subunit All 
computations were done using the first monomer to appear in 2zta (chain A). A backbone-dependent rotamer library 
was used (Dunbrack. et al. (1993) (supra)). Cg angles that were undetennined from the database statistics were as- 
signed the following values: Arg, -60°, 60°, and 180°; Gin. -120°, -60°. 0°, 60°. 120°. and 180°; Glu. 0°. 60°. and 120- 
Lys, -60°, 60°, and 180°. C4 angles that were undetermined from the database statistfcs were assigned the followina 
values: Aig. -1 20°, -eO', 60°. 1 20°. and 1 80°; Lys. -60». 60°. and 1 B0». Rotamers with combinations of c, and c^ that 
resulted in sequential gVg- or g-/g+ angles were eliminated. Uncharged His rotamers were used. A Lennard-Jones 
12-6 potential with van der Waals radii scaled by 0.9 (Dahiyat. etal.. First fully automatic design of a protein achieved 
by Caltech scientists, new press release (1 997) was used for van der Waals interactions. The hydrogen bond potential 
consistedof a distance-dependent term and an angle-dependent temi, as depicted in Equation 9, above. This hydrogen 
bond potential is based on the potential used in DREIDING, with more restrictive angle-dependent terms to limit the 
occurrence of unfavorable hydrogen bond geometries. The angle term varies depending on the hybridization state of 
the donor and acceptor, as shown in Equations 10 to 13. above. 

[01851 In Equations 10-13, 6 is the donor-hydrogen-acceptor angle, i|> is the hydrogen-acceptor-base angle (the base 
is the atom attached to the acceptor, for example the carbonyi carbon is the base for a caibonyl oxygen acceptor) and 
cp IS the angle between the nonnals of the planes defined by the six atoms attached to the spZ centers (the supplement 
of <p IS used when <p is less than 90°). The hydrogen-bond functfon is only evaluated when 2.6 A < /? <3 2 A (|. > 90° 
f - 1 09.5° < 90° for the sp3 donor - sp3 acceptor case, and, <f > 90» for the sp3 donor - spz acceptor case; no switching 
funrtions were used. Template donors and acceptors that were involved in template-template hydrogen bonds were 
not included in the donor and acceptor lists. For the purpose of exclusion, a template-template hydrogen bond was 
considered to exist when 2.5 A S fl s 3.3 A and 6 2 1 35°. A penalty of 2 kcalAnol for polar hydrogen burial, when used 
was only applied to buried polar hydrogens not Involved In hydrogen bonds, where a hydrogen bond was considered 
to exist when Ehb was less than -2 kcal/mol. This penalty was not applied to template hydrogens. The hydrogen-bond 
potential was also supplemented with a weak coulombic teim that included a distance-dependent dielectric constant 
of 40R, where R is the interatomic distance. Partial atomic charges were only applied to polar functional groups A net 
formal charge of +1 was used for Arg and Lys and a net formal charge of -1 was used for Asp and Glu Energies 
associated with a-helical propensities were cafculated using equation 14. above. In Equation 14. is the energy of 
a-helical propensity, AG°^ is the standard free energy of helix propagation of the amino add, and AG'^ is the standard 
ree energy of helix propagation of alanine used as a standard, and N« is the propensity scale factor which was set 
to 3.0. TTiis potentaal was selected in order to scale the propensity energies to a similar range as the other terms in the 
scoring function. The DEE optimization followed the methods of our previous work (Dahiyat. etal., (1996) (supra)) 
[0186] Calculations were performed on either a 12 processor, R1 0000-based Silfcon Graphics Power Challenge or 
a 512 node Intel Delta. 

Ln!f?L. ^^^^^^ synthesis and purification and CD analysis was as in Example 1. NMR samples were prepared in 
90/10 H2O/D2O and 50 mM sodium phosphate buffer at pH 7.0. Spectra were acquired on a Varian Unityplus 600 MHz 
spectrometer at 25 °G. 32 transients were acquired with 1 .5 seconds of solvent presaturation used for water suppres- 
sion. Samples were ~1 mlW. Size exclusion chromatography was performed with a PolyLC hydroxyethyi A column (20 
cm x 9 mm) at pH 7.0 in 50 mM phosphate and 150 mM NaCI at 0 °C. GCN4-p1 and p-Li (Harbury, etal. supra) were 
used as size standards for dimer and tetramer. respectively. 5 jil injections of ~1 mM peptide solution were chroma- 
tographed at 0.50 ml/min and monitored at 214 nm. Samples were mn in triplicate. 
[0188] The surface sequences of all of the peptides examined in this study are shovm In Table 4. 
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Table 4. Sequences and properties of the synthesized peptides 





uesign rneinoa 


Surface Sequence 






N 






bcf bcf bcf bcf 




rc) 

3.831 


(kcBUinri) 

2 


6CN4-n1 


nunc 


KQD EES YHN ARK 


57 


6A 


HB 


EKD RER RRE RRE 


71 


2.193 


2 


6B 


HB + PB 


EKQ KER ERE ERQ 


72 


2.868 


2 


6C 


HB + HP 


ARAAAARRRARA 


69 


-2.041 


2 


60 


scrambled HB 


REE RRR EDR KRE 


71 


2.193 


2 


6E 

* ■ 


random polar 


NTR AKS ANH NTQ 


15 


4.954 


2 


6F 


poly(Ala) 


AAA AAA AAA AAA 


73 


-3.096 


4 



teXTi ^94^ A»lrX°„?f H ''^t ^"^"^^ °' P Wtion Of the 1 2 b. c. and f positions (Chakra- 
^T^ltolT^T^ "^''"'^ '"^"''^ Penatty (P8). 

SZLSrr ^^^'9"^ "ft" « hydrogen-bond potential, has a preponderance of Arg and Glu residues that 
are predicted to form numerous hydrogen bonds to each other These long chain amino acids are favoTeXiuLi 
W^eycan extend acrass turns Of thehelixto inter 

of «,e scrambled 6A sequence. 6D, was found with DEE, far fewer hydrogen bonding lntera^tJ,nl^ e p^^^^^^^ 

poten irteijdirr. b^ '""'^T ' p^'^^'-y^os- ""^'^i p-a'^ in addit!!:::torhXgen s 

potential, is still dominated by long residues such as Lys. Glu and Gin but has fewer Arq Because Am ha<s mo«. n«ilr 
hydrogens than the other amino acids, it more often buries nonhydrogen-bondXrlnslSd ther^^^^^ disfTid 

efTsuor^rrA^^^^^^^ 

etal.. supra). The Arg residues fomi hydrogen bonds with Glu residues at nearby e and g positions The randZ 
hydrophil^sequence. 6E. possesses no hydrogen bonds and scores very poorly w JaH of tJeXSnLions s^^^ 
[019 ] The secondary structures and thermal stabilities of the peptides were assessed by drcular dSsm Tcm 

and 222 nm. except or the random surface sequence peptide 6E. 6E has a spectrum suggestive of a mixture of « 
It^rT '^r?"'" ~" ' "'hile all the other peptides are greater man 90% heSL^ 

Th 'ran?e T '':"gZtt::!'"'°' '"^""^ '^'"''^'"^ ^-"^^ °^ 'ieZZZT.^ 2':^c 

nigner than the T„ of GCN4-p1 , with the exception of 6E which has a T„ of 1 5 'C CD soectra takan hpfnr» ahh .fto. 

s^cihir ^^^^ ^"^'^ -'"^^ ---^-p^ • a -^om hyXhS: sCncrrg^; 

[0192] Size exclusion chromatography (SEC) showed that all the peptides were dimere exceot for 6F the nil a,o 

?ettnce Stt ; "tn (^^^^'^^ ^'^'^'^ion, nuciar magS 

Shown) ^ ^* ~^ '"^ '^''^'"'^ '^'«P«'«to" «"""ar to GCN4-P1 (data not 

Sn?hL I?'"?^®^ ""^^ ^ hydrogen-bond potential, melts at 71 'C versus 57 -C for GCN4-p1 demonstrat- 

u'rr nTcS^^^^^^^^ produce structures that a. markedly more stable ^hannaZ^ 

«mr:. r ^ " ^^^'"'"y ® probably not due to improved hydrogen bonding since 6D which has the 

mo^n^::::':::^^^^^^ « ^'^^-^ -^-i and set of predS 

Eml" ? f*"" the Increased stability of these sequences relative to GCN4-p1 is their higher 

helix p«,pensrty. The long polar residues selected by the hydrogen bond potential. Lys. Glu. Arg and cln. Ire lo 
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among the best helix formers (Chakrabartty, et al.. supra). Since the effect of helix propensity is not as dependent on 
sequence position as that of hydrogen bonding, especially far from the helix ends, little effect would be expected from 

It«ri;Sl LT"'"'! ^[f^ ^ '''"^'^ '"^^"'^ P~P^"«'^ tt^^ sequences, the sum of the 

8 SlSmilihtT^^^ hel X propagation (SAG-) (Chakrabartty. et al.. supra), corresponds to the peptides' thermal 

! S S 1^.°"^ ^® ^^^^'^"'^ P^P*'^^ not quantitatively correlated to the increased 

staDility of these coiled coils. 

[0195] Peptide 6C was designed with helix propensity as part of the scoring function and it has a SAG" of -2 041 

Sl.r'^"^ ^'"^ "^^^ ^^^''■P^ • °' ^9 °C ^"gf'*'y than 6A and 6B, in spite of 6Cs 
^^o^ Pi-^Pf sity. Similarly, 6F has the highest helix propensity possible with an all Ala sequence and a EAG- of 

Huf n %r I J \°[ °^ '"^'^inally higher than that of 6A or 6B. 6F also migrates as a tetramer 

««Tl!f '-^^ ^ • ""^'^ P°'^^'^'^) ^""^^"^ ^"^^^ « '«^9« hydrophobic patch that could mediate 

assoaabon. Though the results for 6C and 6F support the conclusion that helix propensity is important for surface 

S";h ^T?"^ ^^r^^^^ ""^'"S P^°P«"«»y exclusively. Increasing propensity does not necessariV 

l^^lZ^^ ^Ic ^^u °" ^ P^'^'^P* other factors are being effected unfavorably. Also, as is 

evident from 6F. changes in the tertiary structure of the protein can occur. 

SJwSL '^I^^fterlzatlon of these peptides clearly shows that surface residues have a dramatic impact on the 
stab Irty of a-helical coiled coils. The wide range of stabilities displayed by the different surface designs is notable, with 
ro^n^M ,f P'^'** ^^^^^^ hydrophilic sequence (T, 15 'C) and the designed sequences (T„ 

^LIuJL^'l l^n "1?"^ *«"t*t^' 0" other proteins that demonstrnted the importance of solvent exposed 
Mac 1 ^ 1991; Minor, etal.. (1994) (supra); Smith, ef a/.. Science 270:980-982 

It fL ''^"^ ^^''^ significantly higher T„'s than the wildtype GCN4-p1 sequlS^iilemonstratlng 

Inlo f l"^ "^^^ *° '"^^"^^ ^^^^ P~*^'" ^««'9" (O'sf'ea- e»a/., supra). Though helix propensi^ 

h^^!, ^ ^ more important than hydrogen bonding in stabilizing the designed coiled coils, hydrogen bonding could 

be important in the design and stabilization of other types of secondary structure 



Example 3 



Design of a protein containing core, surface and boundary residues using van der Waals, H-bonding. secondary 
structure and solvation scoring functions 

fi"^« """"'^^'y and surface residue work was combined. In selecting a motif to test the 

integration of ourdesign methodologies, we sought a protein fold that would be small enough to be both computationally 
and expenmeritaily tractable, yet large enough to fomi an independently folded structure in the absence of disulfide 
M Qo A t W« '^"^^ PP« ""oW typified by the zinc finger DNA binding module (Pavletich. et al. 

(1 991 ) (supra)). Though it consists of less than 30 residues, this motif contains sheet, helix, and turn structures Further 
recent woric by imperlafi and coworkers who designed a 23 residue peptide, containing an unusual amino add (D- 
,h!2i!"^ fhT r ^""'"u ■■'0-Phenanthrol-2-yl)-L-alanlne), thattakes this stmcture has demonstrated 

pnS Si , absence of metal ions (Struthers. etal.. 1996a). The Brookhaven Protein Data Bank 

Si!^r nK.f ^^^^ ^t™°t"^^^ °t PP« and the second zinc finger 

modu e of the DNA binding protein Zi1268 (PDB code Izaa) was selected as our design template (Pavletich, Jal 
( 991) (supra)). The backbone of the second module aligns very closely with the other two zinc fingers in Zif268 and 
with zinc fingers in other proteins and is therefore representative of this fold class. 28 residues were taken from the 
cn^stal structure starting at lysine 33 in the numbering of PDB entry Izaa which corresponds to ourposition 1 . The first 
12 residues comprise the p sheet with a tight turn at the 6* and 7^^ positions. Two rescues connect the sheet to the 
helix, which extends through position 26 and is capped by the last two residues 

il""^!' *° positions in the template structure into core, surface or boundary classes, the 

2r tin r «" ^"'^ '^''''^ ^''amined. The small size of this mttf 

? ^ ^ °' '^^^ ^^^'9"^'^ unambiguously to the core while six residues 

(positions 3 1 2, 1 8, 21 , 22, and 25) were classified as boundary. Three of these residues are from the sheet (positions 
1 T J™"" (positions 1 8, 21 , 22, and 25). One of the zinc binding residues of Zi^es is in 

the core and two are in the boundary, but the fourth, position 8, has a Ca-Cp vector directed away from the protein's 
geometric center and is therefore classified as a surface position. The othersurface positions considered by the design 
Six tr^h: r. • ' '"^ ''l'''^^' an«^ 23 from the helbc and 14. 27, and 28 whfch c«p he 

J!ln K H ^'P*^ ^^^^ t"^"«' 'lad iTOguiar backbone dihedrals or were 

partially buned. were not included In the sequence selection for this initial study. As in our previous studies, the amino 

w!I V "'"■^ o ^"""3 '^' V. F. Y. and W; the amino acids considered 

at me surface positions were A. S. T. H. D. N. E. Q. K. and R; and the combined core and surface amino acid sets (?6 
amino acids) were considered at the boundary positions. ' ^ 
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S GSSfositiLltlrEf'^T °' 'T'^' "^'^ ''"'^"9 ^^''"^"''^ ^^'««''>"- The algorithm first 

selects Gly for all positions with ^ angles greater than 0« in orderto minimize backbone strain (residues 9 and 27) The 

theTZ'tS ""^"^^^ ^° optimized separately to speed the caTc Z o7e se 

loo dnl 4 7 P"^'"*'" « ^^-'^"^ '^'^^'^^ 1 -2 X 10« possible amino acid sequelcrcorre 

Z^l^lt^ jr ' T>lf other set contained the remaining 1 0 surface residues which had 1 Qio 

S mi ?Ht '"^ ' ''^ "^^ a~"P« interact strongly With each 

a 00^^^;^^ t •''♦''"'"'^ optimizations mutually independent, though there are strong interactions iiZ S 
cZinafes ' "on-optlmized positions in the template Lt to the crystallSgrap^^ 

P200J The optinial sequences found from the two calculations were combined and are shown in Fiaure 8 alianed 

iTonf ,1 if^ ""I'T"' "^"P'"^ ^'^'^ The calculated seven co^and boundS 

pos t ons fom, a well^,acked buried cluster. The Phe side chains selected by the algorithm at the Tc bindTna Hfe 

toi"n; tI'T'k''^ T'''""^ ^ ^ ''-'^ '"''^ Lys at 8 is garter tha 60% ^^^^ 

side chains in an arrangement similar to Zif268. The calculated optimal configuration buried -830 S o nonS* 1 
surface area With Phe 12 (96-^ buried) and Leo 18 (88% buried) anchoring the cluster On thfhelbc surface °he 
algonthm pos tions Asn14 as a helix N-cap with a hydrogen bond between Its side-chain ca*onyl o^g^^^^^^^^^ he 

tho^S??' T"",^' '^'^^S^^ °" f'^'" ^0"" three pairs JSgen bon^ 

nZt ^"-^^ «PPe«^«l to be less Important than the ove^Xix 

propensity o the sequence^ Positions 4 and 11 on the exposed sheet surface were selected to be m one of he be^ 
P-sheetfomilngresidues (Wm& Berg. 1993; Minor, eta,., (1994) (supra); Smith, eta,., (1996) (suprli) 
[0201] Combining the 20 designed positions with the Zif268 amino acids at the remaining 8 sVes results in a oet,tld« 
wrth overall 390^ (11/28) homology to Zif268. which reducesto 15% (3/20) homology wZ oS^ the Sg^d positions 

CenterforBlotechnologylnfomiationfindsweakhomology. less than 40%.toseveralzin^ 

Of other unrelated proteins. None of the alignments had significance values less than 0.2eX ob^c^^^^^^^^ 

was°d4gned " ' "^""'^ ""'^ ""'""'"^^ ProteinranS^Tz^c 

S .^P*'""*"*"' characterization: The far UV circular dichroism (CD) spectrum of the designed molecule 

therSi, ' '"T"'" ' ""^ "^'"'"'^ at 21 8 nm and 208 nm. which is indicative of a folded 8?™^^ The 
themial melt is weakly cooperative, with an inflecUon point at 39 "C, and is completelv reversible The b^«H mi i! 

mo«o, ^ ^ ^' ^tfuthers.efa/.. J.Am. Chem.Soc. 118:3073 (1996b)) 

[0203J Sedimentation equilibrium studies at 100 ^^M and both 7 'C and 25 "C give a molecular mass of 34Qn in nnoH 

HM. however; the data do not fit well to an ideal single species model. When the data were fit to a moS^r cfimer 
e ramer model, dissociation constants of 0.5 - 1.5 mM for monomer-to-dimer and gZ^r h^^Z f^^r^^ 
tetramer were found, though the interaction was too weak to accurately measure these va ueroiff^ln 0^^^ 

^oriiT:?^"*'"'.? P"'^* """"^"^^ «f ' 995) agreed with the se Jmen aTon SsuS a 

100 (UVI pdaSd has a diffusion coefficient close to that of a monomeric zinc f inger control, while at T.5 mM thrSsior^ 

cm^eate ULTIT*^ Z^^^""^ ^* '"^ ^"^ ^ °° ^^-^ '^^"tical with 5 of the Ha-HN 
thof^St? M PP"" ^"'^ "^^^ °f crosspeaks remaining unchanged. These data indicate 

peptiKt^ctr' ' concentration, but this association has essentially no Te^tZ 

Sf« 1 Jm r*^"^" °' P**^^** *"^P^^^' suggesting that the pmtein is folded and well-ordered 

IJd 2. ?''^"."Lm °' ""^"^^ 'P"'^*^""' well-resolved with no overlapping resonances^^F g^^^^^^^ 
and all of the Ha and HN resonances have been assigned. NMR data were collected on a Varian Unit^lus 600 MHz 

»in^ •«? . 2° 50 mM sodium phosphate at pH 5.0. Sample pH was adjusted usina a alass 

sSl ^ 7 ^''^ """^^ ''^ °" "^^^^"^^'^ PH- A" spectra for Li^ents wTre colS at 7 •? 
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TOCSY spectra were recorded with an isotropic mixing time of 80 ms. In TOCSY and DQF-COSY spectra water sup- 
pression was achieved by presaturatlon during a relaxation delay of 1 .5 and 2.0 s, respectively. Water suppression in 
the NOESY spectra was accomplished with the WATERGATE pulse sequence (Piotto. et al.. 1992). Cheniical snins 
were referenced to the HOD resonance. Spectra ware zero-filled In both F2 and F1 and apodized with a shitted gaussian 
5 in F2 and a cosine bell in F1 (NOESY and TOCSY) or a 30° shifted sine bell in F2 and a shifted gaussian in F1 

[M05i|^°w2er.sLED experiments (Altieri. etal., 1995) were run at 25 -C at 1.5 mM. 400 nM and 100 in 99.9% 
D,0 with 50 mM sodium phosphate at pH 5.0. Axial gradient field strength was varied from 3.26 to 53.1 G/cm and a 
diflusion time of 50 ms was used. Spectra were processed with 2 Hz line broadening and integrals of the a">matic and 

10 high field aliphatic protons were calculated and fit to an equation relating resonance amplitude to gradient strength in 
order to extract diffusion coefficients (Altleil. ef a/., 1 995). Diffusion coefficients were 1 .48 x 10-7. 1 .62 x 10- and 1 73 
X 10-7 cm2/s at 1 .5 mM, 400 jxM and 100 pM. respectively. The diffusion coefficient for the zinc finger monomer control 
was 1. 72 x 1 0-7 cm2/s and for protein Gbl was 1.49x1 0-7 cm2/s. uMUMMrxc. 
[0206] All unambiguous sequential and medium-range NOEs are shown in Figure 9A. Ha-HN and/or HN-HN NOEs 

15 were found for all pairs of residues except R6-i7 and K1 6-E1 7. both of which have Regenerate HN chemical shifts and 
P2-Y3 which have degenerate Ha chemical shifts. An NOE is present, however, from a P2 H6 to the Y3 HI^ analogous 
to sequential HN-HN connections. Also, strong K1 Ha to P2 H5 NOEs are present and allowed completion of the 
resonance assignments. . 

[02071 The structure of pdaSd was detemilned using 354 NOE restraints (1 2.6 restraints per residue) that were non- 
20 redundant with covalent structure. An ensemble of 32 structures (data not shown) was obtained using X-PLOR 
(Bainger. 1992) with standard protocols for hybrid distance geometry-simulated annealing. The structures in the en- 
semble had good covalent geometry and no NOE restraint violations greater than 0.3 A. As shown in Table 5, the 
backbone was well defined with a root-mean-square (rms) deviation from the mean of 0.55 A when the d'sordered 
temilnl (residues 1 , 2, 27, and 28) were excluded. The mis deviation forthe backbone (3-26) plus the buned side chains 
25 (residues 3. 5, 7. 1 2. 1 8, 21 , 22, and 25) was 1 .05 A. 
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Table 5. 



5 


NMR structure determination of pda8d: distance restraints, structural statistics, atomic root-mean-square(fTT)s) 
deviations, and comparison to the design target. <SA> are the 32 simulated annealing structures, SA is the average 
structure and SD is the standard deviation. The design target is the backbone of Zif26B. 




Distance restraints 


10 


Intraresidue 
Sequential 
Short range (ll-jl = 2-6 residues) 
Long range (li-jl > 5 residues) 
Total 


148 
94 
78 
34 

354 




Structural statistics 


IS 




<SA> + SD 


20 


Rms deviation from distance restraints (A) 
Rms deviation from idealized geometry (A) 

Bonds (A) 
Angles (degrees) 
Impropers (degrees) 


0.049 ±.004 

0.0051 ± 0.0004 
0.76 + 0.04 
0.56 ± 0.04 




Atomic rms deviations (A)* 






<SA> vs. SA ± SD 


25 


Bacl<bone 
Bacicbone -h nonpolar side chains 
Heavy atoms 


0.65 ± 0.03 
1.05 ±0.06 

1.25 + 0.04 




Atomic mis deviations between pdaSd and the design target (A)* 


30 




SA vs, target 




Backbone 
Heavy atoms 


1.04 

2.15 



•Atomic rms deviations are for residues 3 to 26, Inclusive. The termini, residues 1, 2. 27, and 28. were Wghl/ disordered and had very few non- 
sequential or non-intiaresidue contacts. 



[0208] The NIVIR solution structure of pdaSd shows that It folds into a bba motif with well-defined secondary structure 
elements and tertiary organization which match the design target. A direct comparison of the design template, the 
backbone of the second zinc finger of Zif268, to the pdaBd solution structure highlights their similarity (data not shown). 
40 Alignment of the pdaSd backbone to the design target is excellent, with an atomic rms deviation of 1 .04 A (Table 5). 
PdaBd and the design target correspond throughout their entire structures, including the turns connecting the secondary 
structure elements. 

[0209] In conclusion, the experimental characterization of pdaSd shows that it is folded and well-ordered with a 
weakly cooperative themnal transition, and that its structure is an excellent match to the design target. To our knowledge, 
45 pdaSd is the shortest sequence of naturally occun-ing amino acids that folds to a unique structure without metal binding, 
oligomerization or disulfide bond formation (McKnight. et a/.. Nature Stmc. Biol. 4:1 80 (1 996)). The successful design 
of pdaSd supports the use of objective, quantitative sequence selection algorithms for protein design. This robustness 
suggests that the program can be used to design sequences for de novo backbones. 

50 Example 4 

Protein design using a scaled van der Waals scoring function In the core region 

[0210] An ideal model system to study core packing is the p1 Immunoglobulin-binding domain of streptococcal protein 
55 G (Gpi) (Gronenborn, efa/., Science 253:657 (1991); Alexander, etal., Biochem. 31 : 3697 (1 992); Barchi, efa/., Protein 
Sci. 3:15 (1994); Gallagher, efa/., 1994; Kuszewski, efa/.. 1994; Orban, efa/., 1995). Its small size, 56 residues, 
renders computations and experiments tractable. Perhaps most critical for a core packing study, Gpi contains no 
disulfide bonds and does not require a cofactor or metal ion to fold. Further, Gpi contains sheet, helix andturn structures 
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DeJollTl!I'.Vfh''?''^''''^"°^''^^ ''^'""^ P"""'"^ '"""'^ '^'^'^ "''"^ °^ ^^'"^ This lack of 

sS oS [ T r ^ ""'^'""'^^ °' ^^^■«'y necessitates the use of an objective 

side-chain selection program to examine packing effects ^i^uvs 

Tril l?"„ni^?"'°"! K '^'^ s'^'^chain solvent accessible 

rlm.!,^, ° ^ '"'^ ^^"'^ °^ considered buried Eleven residues 

TIT^ZT'- """"" T ^ ^P"''"""* ^^3' ^2 54). three from the helix (positions 2^ 

mmainir /" ^"/T'^' ^P"^'""" ^S). These positions fomi a contiguous core. The 

remainder of the protein structure. Including all other side chains and the backbone, was used as the template for 
sequence selection calculations at the eleven core positions xempiate tor 

fS (A^'lTf pT/rw^r""*' consisting of alanine, valine, leucine, isoleucine. phenylalanine, tyrosine or tryp- 
tophan (A, y, L, . F, Y or W) were considered. Our rotamer library was similar to that used by Desmet and coworkera 
(Desniet. ef a/.. (1992) (supra)). Optimizing the sequence of the core of Gb1 with 21 7 possible hydraXblc^^le« 
llJZT ^' ^e^^ences. Our scoring function consisted ofTocr^^^^^^ 

Waa s radii of all atoms in the simulation were scaled by a factor a (Eqn. 3) to change the importance of packing effects 
Tnd hTnln ""V T ^ '''^ 'J-n-^ with various SIS 

p^n STpotiL'^^^^ P™^^'"^' ^ "9°^°"^ ^'"''y 0^ P^-'WnS effectsT 

[02131 The protein structure was modeled on the backbone coordinates of Gpl , PDB record Ipga (Bernstein etal 

T"'' ^ °^ ^" '^'^ '^^^'"^ °P^''"'^^<^ w^^« their oystallogShteairdilied 

posrtions. The program BIOGRAF (IWolecular Simulations Incorporated. San Diego. CA) was used to geneSe Zte! 

f!nJL w 1?^^^^^ ""^'y' °P*""'^^'^" ^""^ '^'>"*« as outlined at^ve A 

ardte^^Jh ?^ ""^"'^ interactions, with atomic radii scaled for the various cases 

as discussed herein. The Richards definrtlon of solvent-accessible surface area (Lee & Rtohards. supra) was used and 

fZ' "^"^'"^^^l*"^ «'9«"thm (Connolly. (1 983) (supra)). An atomic solvatton pa^mTer TriJed 

cZlI'irf f ""'"^'^ "'^'^ '^"^^ ^y^^^P*^"""^ ''""^l ^^'f to penalize soLnt exposura To 

cateulate side-chain nonpolar exposure In our optimization framework, we first consider the total hydrophobe area 

the sum of the areas buned m pairwise rotamer/rotamer contacts. 

[02141 Global optimum sequences for various values of the radius scaling factor a were found using the Dead-End 

faSr S^S'""/'"'?' '''"'"^''^ '''' —Pending proteins, are named^the Lius sSe 

factor used m their design. For example, the sequence designed with a radius scale factor of a = 0.90 Is called o90. 

Table 6. 
Gpl sequence 




[02151 in Table 6, the Gpi sequence and position numbers are shown at the top. vol is the fracton of core side-chain 
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volume relative to the Gpi sequence. A vertical bar Indicates identity with the Gp1 sequence. 
[0216] a1 00 was designed with a = 1 .0 and hence sen/es as a baseline for full Incotporation of steric effects. The 
o100 sequence is very similar to the core sequence of Gb1 (Table 6) even though no Infomnation about the naturally 
occurring sequence was used in the side-chain selection algorithm. Variation of a from 0.90 to 1 .05 caused little change 
in the optimal sequence, demonstrating the algorithm's robustness to minor parameter perturbations. Further, the pack- 
ing arrangements predicted with a = 0.90 - 1 .05 closely match Gpi with average x angle differences of only 4' from 
the crystal structure. The high identity and conformational similarity to Gp1 imply that, when packing constraints are 
used backbone conformation strongly determines a single family of well packed core designs. Nevertheless, the con- 
straiiits on core packing were being modulated by a as demonstrated by Monte Carlo searches for other low energy 
sequences. Several alternate sequences and packing arrangements are in the twenty best sequences found by the 
Monte Carlo procedure when a = 0.90. These altemate sequences score much worse when a = 0.95, and when o = 
1 .0 or 1 .05 only strictly consen/ative packing geometries have low energies. Therefore, o = 1 .05 and o = 0.90 define 
the high and low ends, respectively, of a range where packing specifteity dominates sequence design. 
[0217] Fora < 0 90 the role of packing is reduced enough to let the hydrophobic surface potential begin to dominate, 
thereby increasing the size of the residues selected forthe core (Table 6). A significant change in the optimal sequence 
appears between a = 0.90 and 0.B5 with both a85 and aSO containing three additional mutations relative to a90. Also, 
a85 and aBO have a 1 5% increase in total side-chain volume relative to Gb 1 . As a drops below 0.80 an additional 1 0% 
increase in side-chain volume and numerous mutations occur, showing that packing constraints have been over- 
whelmed by the drive to bury nonpolar surface. Though the jumps in volume and shifts in packing anrangement appear 
to occur suddenly for the optimal sequences, examination of the suboptimal low energy sequences by Monte Carlo 
sampling demonstrates that the changes are not abrupt. For example, the a85 optimal sequence is the 1 1«h best se- 
quence when a = 0.90, and similarly, the a90 optimal sequence is the 9th best sequence when a = 0.85. 
[021 81 For a > 1 05 atomte van der Waals repulsions are so severe that most amino adds cannot find any allowed 
packing arrangements, resulting in the selection of alanine for many positions. This stringency is likely an artifact of 
the large atomic radii and does not reflect increased packing specificity accurately. Rather, a = 1 .06 is the upper limit 
forthe usable range of van der Waals scales within our modeling framework. 

[021 9] Experimental characterization of core designs. Variation of the van der Waals scale factor a results in four 
regimes of packing specificity, regime 1 where 0.9 < a S 1 .05 and packing constraints dominate the sequence selection; 
regime 2 where 0.8 S a < 0.9 and the hydrophobic solvation potential begins to compete with packing forces; regime 
3 where a < 0 8 and hydrophobic solvation dominates the design; and. regime 4 where a > 1.05 and van der Waals 
repulsions appearto be too severe to allow meaningful sequence selection. Sequences that are optimal designs were 
selected from each of the regimes for synthesis and characterization. They are a 90 from regime 1 , a 85 from regime 
2 o 70 from regime 3 and a 107 from regime 4. For each of these sequences, the calculated ammo acid identities of 
ttie eleven core positions are shown in Table 6; the remainder of the protein sequence matches Gp1 . The goal was to 
study the relation between the degree of packing specificity used in the core design and the extent of native-like char- 
acter in the resulting proteins. ^ . ^ u ,u 
[0220] Peptlcte synthesis and purification. With the exception of the eleven core positions designed by the se- 
quence selection algorithm, the sequences synthesized match Protein Data Bank entry Ipga. Peptides were synthe- 
sized using standard Fmoc chemistry, and were purified by reveree-phase HPLC. Matrix assisted laser desorption 
mass spectrometry found all molecular weights to be within one unit of the expected masses. 
[0221] CD and fluorescence spectroscopy and size exclusion chromatography. The solution conditions for all 
experiments were 50 mM sodium phosphate buffer at pH 5.5 and 25 "C unless noted. Circular dichroism spectra were 
acquired on an Aviv 62DS spectrometer equipped with a thermoelectric unit. Peptide concentration was approximately 
20 uM Thermal melte were monitored at 218 nm using 2' increments with an equilibration time of 120 s. T„'s were 
defined as the maxima of the derivative of the melting curve. Reversibility for each of the proteins was confimied by 
comparing room temperature CD spectra from before and after heating. Guanidinium chtoride denaturation measure- 
ments followed published methods (Pace, Methods. Enzymol. 131:266 (1986)). Protein concentrattons were deter- 
mined by UV spectrophotometry. Fluorescence experiments were performed on a Hitachi F^SOO in a 1 cm pathlength 
cell Both peptide and ANS concentrations were 50 jiM. The excitation wavelength was 370 nm and emission was 
monitored from 400 to 600 nm. Size exclusion chromatography was perfomied with a PoiyLC hydroxyethyl A column 
at pH 5 5 in 50 mM sodium phosphate at 0 °C. Ribonuclease A, carbonic anhydrase and Gpi were used as molecular 
weight standards. Peptkle concentrations during the separation were -15 jiM as estimated from peak heights moni- 
tored at 275 nm. . ^ ^- ^ 
[0222] N uclear magnetic resonance spectroscopy. Samples were prepared in 90/1 0 HgO/DaO and 50 mM sodium 
phosphate buffer at pH 5.5. Spectra were acquired on a Varian Unityplus 600 MHz spectrometer at 25 'C. Samples 
were approximately 1 mM, except for a70 which had limited solubility (100 jiM). For hydrogen exchange studies, an 
NMR sample was prepared, the pH was adjusted to 5.5 and a spectrum was acquired to serve as an unexchanged 
reference. This sample was lyophilized, reconstituted in DgO and repetitive acquisition of spectra was begun immedi- 
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for rJTut!!^^ I r '^f hours, then the sample was heated to 99 -C 

1 ^ ^ x^"^^ 25 -C. a final spectrum was acquired to sen,e as the 

LtohSi^ rTZ. T ^'^^ " ^" *'«'^«"9««'"« P«ak8 were nomialized by a set of non-exchang ng 

St7JL ^ '«°'°P« measured for all the samples after data acqulsZ 

Ere contnnH " '^^^tr'"'"?"'' 'P""'™ '^"^ '''""^ «^«'*")- »"99estlng that their sSLary 

spec^um im^^^^^^^ 

SZIS •? loss of secondanr structure relative to Gbl . a 1 07 has a spectrum characteristic of a random coil 

S -^l^M H ^"! ^ t ^ ^""^ ^2 -C. respectively, a 107 shows no themial transition, behavior expected from a 
ully unfolded polypeptide, and a 70 has a broad, shallow transition, centered at -40 "6 charactenSi^ oroaZllJ 

the trenS In T 4 denaturat.on measurements of the free energy of unfolding (AGJ at 25 -C match 



rt St^ot"ooL'iW.^nZ'' ^^-illf " T'*^'' ''^^ « 85 is slightly less stable. 
rno«, I? ° "'^"'^ « ^° °'' ''^^^"s® ^liey 'a'^k discernible transitions. 

B^^orol^n'!tl°l S f"^"""" ''^^ '^""^ °^ ^^^^ P~*«'" ««« «=««ssed to gauge 

each proteins degree of native-like character (data not shown), a 90 possesses a highly dispersed soectmm the 

broadened relathre to a 90. suggesting a moderately mobile structure that nevertheless maintains a distinct fo?d a Ws 
Am^S^H H ^ ! "P""''"'" "^"'P "° ^'^P^*'""' ^hich is IndStive of an unfolded protdn 

Tr^Z T'^' """^'"^ °' unexchanged amide protons as a function of time for each of the designed 

2? C ThT^To ? . " '° P"'^*^ =^ "^^^ 20 hours Of exchange at pH 5^nl 

of a^ldror^tl frr?T .s indistinguishable from GpVs (not shown). « 85 also maintains a well-prStected set 
im.TT^ . ""^"^ ^^^'"'^ °^ "'■''^'^ native-like proteins. The number of protected protons, however is 
only about half that of a 90. The difference is likely due to higher flexibility in some parts of the « 86 stSe In 

d nt:S' St. tl? ' '''"''"'^ ''^'^ ^-^«- «^ * ' expele;"ind.Xh^h.; 



[0226] Near UV CD spectra and the extent of 8-anilino-1 -naphthalene sulfonic acid (ANS) bindina were used to 
assess the structural ordering of the proteins. The near UV CD spectra of a85 and a90 havTs^ong SsTeS^ 
or protems wrth aromatic residues fixed in a unique tertiary structure while «70 and a107 have fS.^SsT^^ 
l^^^h ' w '^"^ ^ states or unfSTroSns 

s ronoSr ff ' '"^^'"'^ '"^^^^^^ ^'"^ °f t^^^ ANS emission sp^Tm S 

leslb"e o'^^^^^^^^ that possesses a loosefy packed or partially exposured cluster of hydrophobte r^idu^ 
accessible to ANS. ANS binds a85 weakly, with only a 25% increase in emission intensity, similar to the association 
seen forsome native proteins (Semisotnov. etaJ.. Biopo^ers 31 :119 (1991 )). «90 and aloV 0^50 no chaSreTn ANS 
fluorescence. All of the proteins migrated as monomers during size exclusion chromatography ^ 
Sina GmTZ' " ^^^''-P^^"^^ "^^^^^-''^^ P^^tein by ail crteria. and it is more stable than the naturally 
^mSt T f- P"''""^ ^^'^^"'^ °^ '"'^^^'^'^ hydrophobic surface burial, a 85 is also a stable ordered 
SLhaL hi 9;^^f 7f than cx90. as evidenced by its NMH spectrum and hydrogen excha^ 

speS?^ dIsLlon a disordered collapsed globule: a non-cooperat^,e themial tVansLn. no NMR 

JT^^, f spers on or arnide proton protection, reduced secondary structure content and strong ANS binding a1 07 is 

trendT f Tf'' "i" '""^ '° '"^ '^"^ '^'^^ hydrophobic residues to hold the cJre t<^eSeXclear 
trend is a loss of protern ordering as a decreases below 0 90 luaeuier. ne ciear 

TwS 0 « <?05 "t^tc'^'"'!" ''^'S" experimental data. In regime 

witr?8?a?0 Q n^;Jnn r ^ ""l!^'^'^ P""^'"^ 'P'"'"°'^ ^'="'""9 well-ordered proteins. In regime 2, 
with 0.8 ^ a < 0.9. packing forces are weakened enough to let the hydrophobic force drive larger residues into the core 

K? ^ '"^'"^^ 4. a > 1 .05, the steric forces used to implement packing speciticity^e 

scaled too high to allow reasonable sequence selection and hence produce an unfolded protein. These rS ndTcate 

that effectiveprotendesignrequiresaconsideradon of packing effects. Within theconteL^p^^^^^^^^ 

we have quantitatively defined the range of packing forces necessary for successful designs Atei we h^J dJln ' 

SS^ '"''"'^ '''' *° '^'^'^ P-»^'" alternative pL^^^^^^^ ' 

[0229] To take advantage of the benefits of reduced packing constraints, protein cores should be desianed with the 

smallest a that st.ll resurts in structurally ordered proteins. The optimal prot Jin sequence fro^ rejjme ^ ?aS 
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10 



IS 



20 



25 



30 



35 



40 



45 



SO 



SS 



and well packed suggesting 0.8 ^ a < 0.9 as a good fange. NMR spectra and hydrogen exchange kinetics, however, 
clearly show that o85 is not as structurally ordered as a90. The packing arrangements predicted by our program for 
W43 in o86 and o90 present a possible explanation. For a90, W43 is predicted to pack in the core with the same 
conformation as in the crystal structure of Gpi . In a85, the larger side chains at positions 34 and 54, leucine and 
phenylalanine respectively, compared to alanine and valine In a90, force W43 to expose 91 A* of nonpolar surface 
compared to 19 A2 in a90. The hydrophobic driving force this exposure represents seems likely to stabilize alternate 
conformations that bury W43 and thereby could contribute to oBS's confoimational flexibility (Dill. 1985; Onuchic, ef 
a/ 1996) in contrast to the other core positions, a residue at position 43 can be mostly exposed or mostly buried 
depending on its side-chain confomiation. We designate positions with this characteristic as boundary positions, which 
pose a difficult problem for protein design because of their potential to either strongly interact with the protein's core 

or with solvent. »u j • » 

[0230] A scoring function that penaHzes the exposure of hydrophobic surface area might assist in the design of 
boundary residues. Dili and coworkers used an exposure penalty to improve protein designs in a theoretical study 
fSun. ef a/.. Protein Eng. 8(12)1205-1213 (1995)). 

[0231] A nonpolar exposure penalty would favor packing arrangements that either bury large side chains in the core 
or replace the exposed amino acid with a smaller or more polar one. We implemented a side-chain nonpolar exposure 
penalty in our optimization framework and used a penalizing solvation parameter with the same magnitude as the 

hydrophobe burial parameter. . ^ .. ^ 

[0232] The results of adding a hydrophobic surface exposure penalty to our scoring function are shown in Table 7. 



Table 7. 




[0233] Table 7 depicts the 15 best sequences for the core positions of Gpl using a = 0.85 without an exposure 
penalty. A-o is the exposed nonpolar surface area in A2. 

[0234] When a = 0.85 the nonpolar exposure penalty dramatically alters the ordering of low energy sequences. The 
a85 sequence, the former ground state, drops to 7* and the rest of the 1 5 best sequences expose far less hydrophobic 
area because they bury W43 in a confomiation similar to a90 (model not shown). The exceptions are the B^ and 1 4"' 
sequences, which reduce the size of the exposed boundary residue by replacing W43 with an isoleucine, and the 1 
best sequence which replaces W43 with a valine. The new ground state sequence is very similar to a90. with a single 
valine to isoleucine mutation, and should share a90's stability and structural order. In contrast, when a = 0.90, the 
optimal sequence does not change and the next 14 best sequences, found by IWIonte Carlo sampling, change very 
little This minor effect Is not surprising, since steric forces still dominate for a = 0.90 and most of these sequences 
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expose very Irttle surface area. Burying W43 restricts sequence selection in the core somewhat, butthe reduced packing 
torces for a = 0.85 still produce more sequence variety than a = 0.90. The exposure penalty complements the use of 
reduced packing specificity by limiting the gross overpacking and solvent exposure that occurs when the core's bound- 
ary IS disrupted Adding this constraint should allow lower packing forces to be used in protein design, resulting in a 
broader range of high-scoring sequences and reduced bias from fixed backbone and discrete rotamers. 
[0235] Jo examine the effect of substituting a smaller residue at a boundary position, we synthesized and charac- 
tenzed the 13« best sequence of the a = 0.85 optimization with exposure penalty (Table 8). 

Table 8. 



0=0.85 exposure penalty 



ff 


1 A 

1 np 


TYR 


LEU 


LEU 


1 ALA 


ALA 


PHE 


1 ALA 


VAL 


TRP 


PHE 


VAL 
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20 
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ILE 
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PHE 
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PHE 
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ILE 


ILE 
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PHE 
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29 
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11 
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PHE 
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ILE 








1 


ILE 


1 


TRP 


ILE 
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ILE 
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[0236] Table 8 deptots the 15 best sequences of the core positions of Gpl using a = 0.85 with an exposure penalty. 

IS the exposed nonpolar surface area in A^. r j- 

[0237] This sequence, a85W43V. replaces W43 with a valine but is othenwise identical to aB5. Though the and 
1 4 sequences also have a smaller side chain at position 43. additional changes in their sequences relative to aB5 
would coniplicate interpretation of the effect of the boundary position change. Also, a85W43V has a significantly dif- 
ferent packing arrangement compared to Gp1 . with 7 out of 1 1 positions altered, but only an 8% increase in sideK:hain 
r ■ "^^^"^^ 3 t^st of the tolerance of this fold to a different, but nearly volume conserving, core The 

If,uM u ^P^^*^'" °^ a85W43V is very similar to that of Gpl with an eliipticity at 218 nm of -14000 deg cm2/dmoi 
WhHe the secondary structure content of o85W43V is native-like, its T„ is 65 "C, nearly 20 'C lower than o85 In 
contrast to a85W43V6 decreased stability, its Um spectrum has greater chemical shift dispersion than a85 (data not 
Shown). The amide hydrogen exchange kinetics show a well protected set of about four protons after 20 hours (data 
not shown). This faster exchange relative to o8S is explained by a85W43V'8 signiffcantly lower stabil ity (Mayo & Bald- 
win. 1 993). a85W43V appears to have improved structural specificity at the expense of stability, a phenomenon ob- 
sen/ed previously in coiled colls (Harbury, et al.. 1 993). By using an exposure penalty, the design algorithm produced 
a protein with greater native-like character. k « 

[0238] We have quantitatively defined the role of packing specificity in protein design and have provided practical 
bounds for the role of steric forces in our protein design program. This study differs from previous work because of the 
use of an objective, quantitative program to vary packing forces during design, which allows us to readily apply our 
conclusions to different protein systems. Further, by using the minimum effective level of steric forces, we were able 
to design a wider variety of packing arrangements that were compatible with the ghren fold. Rnally, we have identified 
a difficulty in the design of side chains that lie at the boundary between the core and the surface of a protein and we 
have Implemented a nonpolar surf ace exposure penalty in our sequence design scoring function that addresses this 
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problem. 
Exan^le 5 

Design of a full protein 

[02391 The entire amino acid sequence of a protein motif has been computed. As In Example 4. the second zinc 
finger module of the DNA binding protein Zif268 was selected as the design template. In order to assign the residue 
positions in the template structure Into core, surface or boundary classes, the orientation of the Ca-Cp vectors was 
assessed relative to a solvent accessible surface computed using only the template Co atoms. A solvent accessible 
surface for only the Ca atoms of the target fold was generated using the Connolly algorithm with a probe radius of 8.0 
A a dot density of 1 0 A2 and a Ca radius of 1 .95 A. A residue was classified as a core position if the distance from 
its Ca along Its Ca-Cp vector, to the solvent accessible surface was greater than 5 A. and if the distance from its Cp 
to the nearest surface point was greater than 2.0 A. The remaining residues were classified as surface positions if the 
sum of the distances from their Ca, along their Ca-Cp vector, to the solvent accessible surface plus the distance from 
their CB to the nearest surface point was less than 2.7 A. All remaining residues were classified as boundary positions. 
The classifications for Zif268 were used as computed except that positions 1. 17 and 23 were converted from the 
boundary to the surface class to account for end effects from the proximity of chain termini to these residues in the 
terlary structure and inaccuracies In the assignment. 

[02401 The small size of this motif limits to one (position 5) the n umber of residues that can be assigned unambig u- 
ously to the core while seven residues (positions 3, 7, 12, 18. 21. 22. and 25) were classified as boundary andthe 
remaining 20 residues were assigned to the surface. Interestingly, while three of the zinc binding positions of Zif268 
are In the boundary or core, one residue, position 8, has a Ca-Cp vector directed away from the protein's geometric 
center and Is classified as a surface position. As In our previous studies, the amino acids considered at the core 
positions during sequence selection were A. V. L, I. F. Y. and W; the amino acids considered at the surface positions 
were A S T, H D N E Q K. and R; and the combined core and surface amino acid sets (16 amino acids) were 
considered at the boundary positions. Two of the residue positions (9 and 27) have ^ angles greater than 0» and are 
set to Gly by the sequence selection algorithm to mlnmize backbone strain. 

[02411 The total number of amino acid sequences that must be considered by the design algorithm Is the product of 
the number of possible amino acid types at each residue position. The ppa motif residue classification described above 
results in a virtual combinatorial library of 1 .9 x 1 027 possible amino acid sequences (one core position with 7 possible 
amino acids, 7 boundary positions with 16 possible amino adds. 18 surface positions with 10 possible amino acids 
and 2 positions with ^ angles greater than 0» each with 1 possible amino acid). A corresponding peptide library con- 
sisting of only a single molecule for each 28 residue sequence would have a mass of 11 .6 metnc tons. In order to 
accurately model the geometric specificity of side-chain placement, we explicitly consider the torsional flexibility of 
amino acid side chains in our sequence scoring by representing each amino acid with a discrete set of allowed con- 
formations called rotamers. As above, a backbone dependent rotatmer library was used (Dunbrack and Karplus. su- 
pra), with adjustments In the X, and angles of hydrophobic residues. As a result, the design algorithm must consider 
all rotamers for each possible amino acid at each residue position. The total size of the search space for the ppa rnotif 
is therefore 1.1x1 062 possible rotamer sequences. The rotamer optimization problem for the ppo motif required 90 
CPU hours to find the optimal sequence. .. „ ,»u 

[02421 The optimal sequence, shown in Figurel 1 , is called Full Sequence Design-I (FSD-1 ). Even though all of the 
hydrophillc amino acids were considered at each of the boundary positions, the algorithm selected only nonpolar amino 
acids. The eight core and boundary positions are predicted to fomi a well-packed buried cluster. The Phe side chains 
selected by the algorithm at the zinc binding His positions of Zif268, positions 21 and 25. are over 80% buried and the 
Ala at position 5 is 1 00% buried while the Lys at position 8 is greater than 60% exposed to solvent. The other boundary 
positions demonstratethe strong steric constraints on buried residues by packing similar side chains in an arrangement 
similar to that of Zif268. The calculated optimal configuration for core and boundary resWues bunas -1150 A' of 
nonpolar surface area. On the helix surface, the program positions Asn 14 as a helix N-cap with a hydrogen bond 
between its slden^hain carbonyl oxygen and the backbone amide proton of residue 1 6. The eight charged residues on 
the helix form three pairs of hydrogen bonds, though in ourcoiled coil designs helical surface hydrogen bonds appeared 
to be less important than the overall helix propensity of the sequence (Dahlyat, et al. , Science (1 997)). Positioris 4 and 
11 on the exposed sheet surface were selected to be Thr, one of the best p-sheet fomiing residues (Kim. etal. 1993). 
[02431 Rgure 11 shows the alignment of the sequences for FSD-1 and Zif268. Only 6 of the 28 residues (21%) are 
identical and only 11(39%) are similar. Four of the identities are in the buried cluster, whtoh is consistent with the 
expectation that buried residues are more consen/ed than solvent exposed residues for a given motif (Bowie, et al.. 
Science 247:1306-1310(1990)). A BLAST (Altschul, ef al.. supra) search of the FSD-1 sequence against the non- 
redundant protein sequence database of the National Centerfor Biotechnology Information did not find any zinc finger 
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protein sequences. Further, the BLAST search found only low identity matches of weal< statistical significance to frag- 
ments of various unrelated proteins. The highest identity matches were 10 residues (36%) with p values ranging from 
0.63 - 1 .0. Random 28 residue sequences that consist of amino acids allowed in the^pa position classification described 
above produced similar BLAST search results, with 10 or 11 residue identities (36 - 39%) and p values ranging from 
0.35 - 1 .0, further suggesting that the matches found for FSD-1 are statistically insignificant. The very low identity to 
any l^nown protein sequence demonstrates the novelty of the FSD-1 sequence and underscores that no sequence 
infomiation from any protein motif was used in our sequence scoring function. 

[0244] In order to examine the robustness of the computed sequence, the sequence of FSD-1 was used as the 
starting point of a Monte Carlo simulated annealing run. The Monte Carlo search finds high scoring, suboptimal se- 
quences in the neighborhood of the optimal solution (Dahiyat, et aL, (1996) (supra)). The energy spread from the 
ground-state solution to the 1000«i most stable sequence is about 5 kcal/mol Indicating that the density of states Is 
high. The amino acids comprising the core of the molecule, with the exception of position 7, are essentially invariant 
(Figure 11). Almost all of the sequence variation occurs at surface positions, and typically involves consen/ative chang- 
es. Asn 14. which is predicted to fomn a helix N-cap, is among the most conserved surface positions. The strong 
sequence conservation observed for critical areas of the molecule suggests that if a representative sequence folds 
into the design target structure, then perhaps thousands of sequences whose variations do not disrupt the critical 
interactions may be equally competent. Even if billions of sequences would successfully achieve the target fold, they 
would represent only a vanishingly small proportion of the 10^7 possible sequences. 

[0245] Experimental validation. FSD-1 was synthesized in order to characterize its structure and assess the per- 
fomnance of the design algorithm. The far UV circular dichroism (CD) spectmm of FSD-1 shows minima at 220 nm and 
207 nm, which is indicative of a folded structure (data not shown). The thermal melt is weakly cooperative, with an 
inflection point at 39 ^'C, and is completely reversible (data not shown). The broad melt is consistent with a low enthalpy 
of folding which is expectedforamotif with a small hydrophobic core. This behavior contrasts the uncooperative themial 
unfolding transitions observed for other folded short peptides (Scholtz, et a/., 1991). FSD-1 is highly soluble (greater 
than 3 mM) and equilibrium sedimentation studies at 100 jiM, 500 and 1 mM show the protein to be monomerlc. 
The sedimentation data fit well to a single species, monomer model with a molecular mass of 3630 at 1 mM, in good 
agreement with the calculated monomer mass of 3488. Also, far UV CD spectra showed no concentration dependence 
from 50 \lM to 2 mM, and nuclear magnetic resonance (NMR) COSY spectra taken at 1 00 fiM and 2 mM were essentially 
identical. 

[0246] The solution structure of FSD-1 was solved using homonuclear 2D ''H NMR spectroscopy (Piantini, et al., 
1982). NMR spectra were well dispersed indicating an ordered protein structure and easing resonance assignments. 
Proton chemical shift assignments were determined with standard homonuclear methods (Wuthrich, 1 986). Unambig- 
uous sequential and short-range NOEs indicate helical secondary structure from residues 15 to 26 in agreement with 
the design target. 

[0247] The stmcture of FSD-1 was detemnined using 284 experimental restraints (10.1 restraints per residue) that 
were non-redundant with covalent structure including 274 NOE distance restraints and 10 hydrogen bond restraints 
involving slowly exchanging amide protons. Structure calculations were perf omned using X-PLOR (Brunger, 1 992) with 
standard protocols for hybrid distance geometry-simulated annealing (Nilges, etaL, FEBS Lett. 229:317 (1988)). An 
ensemble of 41 structures converged with good covalent geometry and no distance restraint violations greater than 
0.3 A (Table 9). 



Table 9. 



NMR structure determination: distance restraints, structural statistics and atomic root-mean-square (nns) 
deviations. <SA> are the 41 simulated annealing structures, SA Is the average structure before energy minimization, 
(SA)n is the restrained energy minimized average structure, and SD is the standard deviation. 


Distance restraints 


Intraresidue 




97 




Sequential 




83 




Short range (li-jl = 2-5 
residues) 




59 




Long range (li-jl > 5 
residues) 




35 




Hydrogen bond 




10 




Total 




284 
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Table 9. (continued) 



NMR structure determination: distance restraints, structural statistics and atomic root-mean-square (rms) 
deviations. <SA> are the 41 simulated annealing structures, S A is the average structure before energy minimization, 
(SA)r is the restrained energy minimized average structure, and SD Is the standard deviation. 


Structural statistics 




<SA> ± SD 




(SA)r 


Rms deviation from 
distance restraints (A) 


0.043 ± 0.003 




0.038 


Rms deviation from 
idealized geometry 

Bonds (A) 
Angles 

(degrees) 

Impropers 
(degrees) 


0.0041 ± 0.0002 
0.67 ± 0.02 

0.53 ± 0.05 




0.0037 
0.65 

0.51 


Atomic rms deviations (A)* 




<SA> vs. SA ± SD 




<SA> vs. (SA)p, ± SD 


Backbone 


0.54 ± 0.15 




0.69 ±0.1 6 


Backbone 4- nonpolar side 
chalnst 


0.99 ±0.17 




1.16±0.18 


Heavy atoms 


1 .43 ± 0.20 




1.90 ±0.29 



'Atomic rms deviations are for residues 3 to 26, Inclush/e. Residues 1 , 2, 27 and 28 were disordered v angular order parameters (34) < 0.78) 
and had only sequential and li-JI = 2 NOEs. ^Nonpolar side chains are from residues 3, 5, 7, 12, 18, 21. 22, and 25 whicti consitute ttie core of the 
protein. 



[0248] The backbone of FSD-1 Is well defined with a root-mean-square (mis) deviation from the mean of 0.54 A 
(residues 3-26). Considering the burled side chains (residues 3, 5, 7 J 2, 1 8, 21 , 22, and 25) in addition to the backbone 
gives an rms deviation of 0.99 A, indicating that the core of the molecule is well ordered. The stereochemical quality 
of the ensemble of structures was examined using PROCHECK (Laskowski, etal., J. Appl. Crystallogr. 26:283 (1 993)). 
Not including the disordered tennlnl and the glycine residues, 87% of the residues fall in the most favored region and 
the remainder In the allowed region of y space. Modest heterogeneity Is present in the first strand (residues 3-6) 
which has an average backbone angular order parameter (Hyberts, et ai, 1 992) of <S> = 0.96 ± 0.04 compared to the 
second strand (residues 9-12) with an <S> = 0.98 ± 0.02 and the helix (residues 15-26) with an <S> = 0.99 ± 0.01 . 
Overall, FSD-1 is notably well ordered and, to our knowledge, Is the shortest sequence consisting entirely of naturally 
occurring amino acids that folds to a unique structure without metal binding, ollgomerization or disulfide bond formation 
(McKnight, efa/., 1997), 

[0249] The packing pattern of the hydrophobic core of the NMR structure ensemble of FSD-1 (Tyr 3, lie 7, Phe 12, 
Leu 1 8, Phe 21 , lie 22, and Phe 25) Is similar to the computed packing arrangement. Five of the seven residues have 
angles In the same gauche-*^, gauche- or trans category as the design target, and three residues match both and 
Xg angles. The two residues that do not match their computed X^ angles are fie 7 and Phe 25, which Is consistent with 
their location at the less constrained, open end of the molecule. Ala 5 is not Involved in Its expected extensive packing 
interactions and instead exposes about 45% of its surface area because of the displacement of the strand 1 backbone 
relative to the design template. Conversely, Lys 8 behaves as predicted by the algorithm with its solvent exposure 
(60%) and X^ and X2 angles matching the computed structure. Most of the solvent exposed residues are disordered 
which precludes examination of the predicted surface residue hydrogen bonds. Asn 14, however, fonns a helix N-cap 
from its sidechain cariDonyl oxygen as predicted, but to the amide of Glu 17, not Lys 16 as expected from the design. 
This hydrogen bond is present In 95% of the structure ensemble and has a donor-acceptor distance of 2.6 ± 0.06 A. 
In general, the side chains of FSD-1 con'espond well with the design program predictions. 

[0250] A comparison of the average restrained minimized structure of FSD-1 and the design target was done (data 
not shown). The overall backbone rms deviation of FSD-1 from the design target Is 1 .98 A for residues 3-26 and only 
0.98 A for residues 8-26 (Table 1 0). 
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Table 10. 



Comparison of the FSD-1 experimentally determined structure and the design target structure. The FSD-1 structure 
is the restrained energy minimized average from the NMR structure determination. The design target structure is 
the second DNA binding module of the zinc finger Zlf268 (9). 


Atomic rms deviations (A) 


Backbone, residues 3-26 




1.98 




Bacicbone, residues 8*26 




0.98 




Super-secondary structure parameter^ 




FSD-1 




Design Target 


h(k) 


9.9 




8.9 


6(degrees} 


14.2 




16.5 


n(degrees) 


. 13.1 




13.5 



•/I, e, fl are calculated as previously described (36, 37). b is the distance between the centrold of the helix Ca coordinates (residues 15-26) and the 
least-square plane fit to the Ca coordinates ot the sheet (residues 3-12. 6 Is the angle of inclination of the principal moment of the helix Ca atoms 
with the plane of the sheet a Is the angle between the projection of the principal moment of the helix onto the sheet and the projection of the average 
least-square fit ine to the strand Ca coordinates (residues 3-6 and 9*12) onto the sheet. 



[0251] The largest difference between FSD-1 and the target structure occurs from residues 4-7, with a displacement 
of 3.0-3.5 A of the backbone atom positions of strand 1 . The agreement for strand 2, the strand to helix turn, and the 
helix is remarkable, with the differences nearly within the accuracy of the structure detennlnation. For this region of 
the structure, the mis difference of ^,\\f angles between FSD-1 and the design target Is only 14 ± 9**. In order to quan- 
titatively assess the similarity of FSD-1 to the global fold of the target, we calculated their supersecondary structure 
parameters fTable 9) (Janin & Chothia, J. MoL Biol. 143:95 (1980); Su & Mayo, Protein Sd. In press, 1997). which 
describe the relative orientations of secondary structure units in proteins. The values of 9, the inclination of the helix 
relative to the sheet, and i2, the dihedral angle between the helix axis and the strand axes, are nearly identical. The 
height of the helix above the sheet, h, Is only 1 A greater in FSD-1 . A study of protein core design as a function of helix 
height for Gbl variants demonstrated that up to 1 .5 A variation in helix height has little effect on sequence selection 
(Su & Mayo, supra, 1 997). The comparison of secondary structure parameter values and backbone coordinates high- 
lights the excellent agreement between the experimentally determined structure of FSD-1 and the design target, and 
demonstrates the success of our algorithm at computing a sequence for this ppa motif. 

[0252] The quality of the match between FSD-1 and the design target demonstrates the ability of our program to 
design a sequence for a fold that contains the three major secondary stmcture elements of proteins: sheet, helix, and 
turn. Since the ppa fold is different from those used to develop the sequence selection methodology, the design of 
FSD-1 represents a successful transfer of our program to a new motif. 

Example 6 

♦ 

Calculation of solvent accessible surface area scaling factors 

[0253] In contrast to the previous work, backbone atoms are included in the calculation of surface areas. Thus, the 
calculation of the scaling factors proceeds as follows. 

[0254] The program BIOGRAF (Molecular Simulations Incorporated, San Diego, California) was used to generate 
explicit hydrogens on the structures which were then conjugate gradient minimized for 50 steps using the DREIDiNG 
force field. Surface areas were calculated using the Connolly algorithm with a dot density of 10 A-2, using a probe 
radius of zero and an add-on radius of 1 .4 A and atomic radii from the DREIDING force-field. Atoms that contribute to 
the hydrophobic surface area are carbon, sulfur and hydrogen atoms attached to carbon and sulfur. 
[0255] For each side-chain rotamer rat residue position / with a local tri-peptide backbone t3, we calculated A 0,^3 
the exposed area of the rotamer and its backbone in the presence of the local tri-peptide backbone, and >*/^the exposed 
area of the rotamer and its backbone in the presence of the entire template r which includes the protein backbone and 
any side-chains not Involved In the calculation (Rgure 1 3). The difference between A^f^, and Af^ls the total area burled 
by the template for a rotamer r at residue position /. For each pair of residue positions /and j and rotamers r and s on 
/ and j, respectively. Aj^^ the exposed area of the rotamer pair in the presence of the entire template, is calculated. The 
difference between Afj^ and the sum of Af^ and Aj^ is the area buried between residues / and / excluding that area by 
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the template. The pairwise approximation to the total burled surface area is: 



Equation 29: 



[0256] As shown in Figure 1 3, the second sum in Equation 29 over-counts the buried area. We have therefore mul- 
tiplied thesecond sum by ascale factor f whose value is to be determined empirically. Expected values of f are discussed 
below. 

[0257] Noting that the burled and exposed areas should add to the total area, Z^^/^a, the solvent-exposed surface 
area is: 



Equation 30: 

[0258] The first sum of Equation 30 represents the total exposed area of each rotamer in the context of the protein 
template Ignoring Interactions with other rotamers. The second sum of Equation 30 subtracts the burled areas between 
rotamers and is scaled by the same parameter f as in Equation 29. 

[0259] Some insight into the expected value of f can be gained from consideration of a close-packed face centered 
cubic lattice of spheres or radius r. When the radii are Increased from rto R, the surface area on one sphere buried 
by a neighboring sphere Is 2nR(R - /). We take rto be a carbon radius (1 .96 A), and Is 1 ,4 A larger. Then, using: 

true buried area 
~ pairwise buried area 

and noting that each sphere has 12 neighbors, results In: 

2x271 /?(«-r) 

[0260] This yields f = 0.40. A close-packed face centered cubic lattice has a packing fraction of 74%. Protein interiors 
have a similar packing fraction, although because many atoms are covalently bonded the close packing is exaggerated. 
Therefore this value of f should be a lower bound for real protein cores. For non-core residues, where the packing 
fraction is lower, a somewhat larger value of f Is expected. 

[0261] We classified residues from ten proteins ranging in size from 54 to 2B9 residues into core or non-core as 
follows. We classified resides as core or non-core using an algorithm that considered the direction of each side-chain's 
Ca-cp vector relative to the surface computed using only the template Ca atoms with a carbon radius of 1 .95 A, a 
probe radius of 8 A and no add-on radius. A residue was classified as a core position if both the distance from its Ca 
atom (along its Ca-Cp vector) to the surface was greater than 5.0 A and the distance from Its Cp atom to the nearest 
point on the surface was greater than 2.0 A. The advantage of such an algorithm Is that a knowledge of the amino acid 
type actually present at each residue position is not necessary. The proteins were as shown in Table I, showing selected 
proteins, total number of residues and the number of residues in the core and non-core of each protein (Gly and pro 
were not considered). 



Brookhaven Identifier 


Total Size 


Core Size 


Non-Core Size 


1enh 


54 


10 


40 


1pga 


56 


10 


40 
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(continued) 



Brookhaven Identifier 


Total Size 


Core Size 


Non-Core Size 


lubi 


76 


16 


50 


1mol 


94 


19 


61 


IKpt 


105 


27 


60 


4azu-A 


128 


39 


71 


igpr 


158 


39 


89 


1gcs 


174 


53 


98 


ledt 


266 


95 


133 


1pbn 


289 


96 


143 



[0282] The classification into core and non-core was made because core residues interact more strongly with one 
another than do non-core residues. This leads to greater over-counting of the buried surface area for core residues. 
[0263] Considering the core and non-core cases separately, the value of f which most closely reproduced the true 
Lee and Richards surface areas was calculated for the ten proteins. The painwise approximation very closely matches 
the true burled surface area (data not shown). It also perfonms very well for the exposed hydrophobic surface area of 
non-core residues (data not shown). The calculation of the exposed surface area of the entire core of a protein Involves 
the difference of two large and nearly equal areas and Is less accurate; as will be shown, however, when there Is a 
mixture of core and non-core residues, a high accuracy can still be achieved. These calculations indicate that for core 
residues f is 0.42 and for non-core residues f Is 0.79. 

[0264] To test whether the classification of residues Into core and non-core was sufficient, we examined subsets of 
interacting residues in the core and non-core positions, and compared the true burled area of each subset with that 
calculated (using the above values of f). For both subsets of the core and the non-core, the congelation remained high 
(/? = 1 .00) Indicating that no further classification Is necessary (data not shown). (Subsets were generated as follows: 
given a seed residue, a subset of size two was generated by adding the closest residue: the next closest residue was 
added for a subset of size three, and this was repeated up to the size of the protein. Additional subsets were generated 
by selecting different seed residues.) 

[0265] It remains to apply this approach to calculating the buried or exposed surface areas of an arbitrary selection 
of interacting core and non-core residues in a protein. When a core residue and a non-core residue interact, we replace 
Equation 29 with: 

Equation 31: 



and Equation 30 with Equation 32: 



Abased "^A,^,- S (£^A^^^-^f^A^^,-f,jA,^^^) 



where f, and y are the values of f appropriate for residues / and y, respectively, and f^^ takes on an intermediate value. 
Using subsets from the whole of 1pga, the optimal value of f^was found to be 0.74. This value was then shown to be 
. appropriate for other test proteins (data not shown). 



Claims 



1 . A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method comprising the steps of: 
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(A) receiving a protein backbone structure with variable residue positions; 

(B) selecting a collection of variable positions; 

(C) establlsiiing a group of potential rotamers for each of said variable residue positions, and wherein a first 
group for a first variable position has a first set of rotamers from at least two different amino acid side chainSi 
and wherein a second group for a second variable position has a second set of rotamers from at least two 
different amino acid side chains, and wherein said sets are different; and, 

(D) analyzing the interaction of each of said rotamers in each group with all or part of the remainder of said 
protein to generate a set of optimized protein sequences. 

A method executed by a computer under the control of a program, said computer Including a memory for storing 
said program, sajd method comprising the steps of: 

(A) receiving a protein backbone structure with variable residue positions; 

(B) selecting a collection of variable positions; 

(C) establishing a group of potential amino acids for each of said variable residue positions, wherein a first 
group for a first variable position has a first set of at least two amino acid side chains, and wherein a second 
group for a second variable position has a second set of at least two different amino acid side chains, and 
wherein said sets are different; and; 

(D) analyzing the Interaction of each of said amino acids with all or part of the remainder of said protein to 
generate a set of optimized protein sequences. 

A method according to claim 1 or claim 2, wherein said analyzing step utilizes a set of at least one scoring function. 

A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method comprising the steps of: 

(A) receiving a protein backbone structure with variable residue positions; 

(B) selecting a set of variable residue positions; 

(C) establishing a group of potential rotamers for each of said variable residue positions; and, 

(D) analyzing the Interaction of each of said rotamers with all or part of the remainder of said protein to generate 
a set of optimized protein sequences, wherein said analyzing step includes the use of at least one scoring 
function, and wherein for at least two positions a different set of scoring functions is utilized. 

A method according to any one of the preceding claims, further comprising classifying each variable residue po- • 
sition as either a core, surface or boundary residue. 

A method executed by a computer under the control of a program, said computer Including a memory for storing 
said program, said method comprising the steps of: 

A) receiving a protein backbone structure with variable residue positions; 

(B) classifying each variable position as either a core, surface or boundary position; 

(C) establishing a group of potential amino acids for each of said variable residue positions, wherein the group 
for at least one variable residue position has at least two different amino acid side chains; and, 

(D) analyzing the Interaction of each of said amino acids with ail or part of the remainder of said protein to 
generate a set of optimized protein sequences, wherein said analyzing step Includes the use of at least one 
scoring function. 

A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method connprising the steps of: 

(A) receiving a protein backbone structure with variable residue positions; 

(B) selecting a set of variable positions; 

(C) establishing a group of potential rotamers for each of said variable residue positions, wherein the group 
for at least one variable residue position has rotamers of at least two different amino acid side chains, and 
wherein at least one of said amino acid side chains is from a hydrophillc amino acid and, 

(D) analyzing the interaction of each of said rotamers with all or part of the remainder of said protein to generate 
a set of optimized protein sequences, wherein said analyzing step includes the use of at least one scoring 
function. 
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8. A method according to ciaim 7, wherein said hydrophilic amino acid is selected from tlie group consisting of serine, 
threonine, aspartic acid, asparagine, glutamine, glutamic acid, arglnine, lysine, and histidine. 

9. A method according to any one of the preceding claims, further comprising physically generating at least one 
member of said set of optimized protein sequences and experimentally testing said sequence for a desired function. 

10. A method according to any one of the preceding claims, wherein at least two of said set of optimized protein 
sequences comprises at least two sequences. 

1 1 . A method according to any one of the preceding claims, wherein said set of optimized protein sequences comprises 
at least 10^^ protein sequences. 

12. A method according to any one of the preceding claims, wherein said set of optimized protein sequences comprises 
at least 1 0^ protein sequences. 

13. A method according to any one of the preceding claims, wherein said set of optimized proteins sequences com- 
prises at least 10^ sequences. 

14. A method executed by a computer under the control of a program, said computer including a memory for storing 
said program, said method comprising the steps of: 

(A) receiving a protein bacl<bone structure with variable residue positions; 

(B) selecting a collection of variable positions; 

(C) establishing a group of potential amino acids for each of said variable residue positions, wherein a first 
group for a first variable position has a first set of at least two amino acid side chains, and wherein a second 
group for a second variable position has a second set of at least two different amino acid side chains, and 
wherein said sets are different; and; 

(D) analyzing the interaction of each of said amino acids with ail or part of the remainder of said protein to 
generate a set of optimized protein sequences, wherein said analyzing step includes the use of a side chain 
confonnation dependent scoring function. 

15. A method according to any one of the preceding claims, further comprising a side chain confonDation dependent 
scoring function. 

16. A method according to any one of the preceding claims, wherein said scoring functions are scaled using a scaling 
factor, 

1 7. A method according to any one of the preceding claims, wherein said analyzing step comprises a DEE computation. 

1 8. A method according to any one of the preceding claims, wherein said set of optimized protein sequences comprises 
the globally optimal protein sequence. 

19. A method according to claim 17, wherein said DEE computation is selected from the group consisting of original 
DEE and Goldstein DEE. 

20. A method according to any one of the preceding claims, wherein said set of scoring functions includes a scoring 
function is selected from the group consisting of a van der Waals potential scoring function a hydrogen bond 
potential scoring function, an atomic solvation scoring function, an electrostatic scoring function and a secondary 
stmcture propensity scoring function. 

21. A method according to any one of the preceding claims, wherein said set of scoring functions Includes at least 
three scoring functions. 

22. A method according to any one of the preceding claims, wherein said set of scoring functions includes at least 
four scoring functions. 
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